COMET framework closes the audio-text modality gap without costly retraining
New spectral truncation method boosts zero-shot audio captioning to near-supervised levels.
Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding, but suffer from a persistent modality gap between audio and text embeddings. Existing explanations attribute this to a mean shift (cone effect), yet correcting the mean alone yields limited gains. In a new paper, Zhu et al. introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a partial least squares singular value decomposition framework that systematically dissects CLAP embeddings. Their analysis reveals that only a small, interpretable subset of axes—those capturing shared concepts—contribute substantially to similarity computations, while the mean component only partially represents the gap.
Building on this insight, the authors propose a simple, training-free spectral truncation method that removes noise from irrelevant axes. This approach effectively mitigates the modality gap without requiring large auxiliary memory banks or expensive computation. In zero-shot audio captioning tasks with condition swapping, the method approaches fully supervised performance. It also achieves substantial embedding dimensionality reduction while preserving strong results on retrieval and captioning benchmarks. The work provides both a theoretical understanding of the modality gap and a practical solution for deploying CLAP models more efficiently.