CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
New method corrects CLIP's tendency to ignore details after the first sentence of a caption, improving long-text retrieval.
A team from the University of Toronto and Vector Institute has published a paper revealing a fundamental flaw in how CLIP (Contrastive Language-Image Pre-training) models process information. The research, titled 'CLIP Is Shortsighted: Paying Attention Beyond the First Sentence,' identifies that CLIP's training on internet data, dominated by short captions, has biased it to encode only simple object descriptions. More critically, when fine-tuned on longer captions—whether human- or LLM-generated—the model learns a shortcut: it concentrates its attention almost exclusively on the opening summary sentence, effectively ignoring the detailed description that follows. This 'first-sentence bias' leads to poor alignment for complex scenes and weak performance on long-text retrieval tasks, limiting the capabilities of the many downstream applications that rely on CLIP as a vision encoder.
The researchers' solution, DeBias-CLIP, is a novel training methodology designed to distribute the model's 'attention' across an entire caption. The technique involves strategically removing the initial summary sentence during training and applying sentence sub-sampling and text token padding. This forces the model to learn from all parts of the text, not just the beginning. The result is a model that achieves state-of-the-art performance on long-text retrieval benchmarks, shows improved short-text retrieval, and is more robust to sentence order changes. Crucially, DeBias-CLIP is a drop-in replacement for existing long-caption CLIP variants like Long-CLIP, requiring no new trainable parameters, which means it can be seamlessly integrated to upgrade the multi-modal understanding of systems from text-to-image generators to large vision-language models (VLMs).
- Identifies 'first-sentence bias' where CLIP models ignore text after an opening summary, weakening alignment for detailed descriptions.
- DeBias-CLIP method removes summary sentences and uses token padding, achieving SOTA long-text retrieval with no extra parameters.
- Acts as a drop-in replacement for Long-CLIP, offering immediate upgrades for diffusion models and vision-language assistants.
Why It Matters
Fixes a core weakness in foundational AI vision models, leading to more accurate image generation and better understanding of complex scenes.