Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality
A new pre-training paradigm restructures attention flow to compress semantic info into a single token.
A team of researchers has introduced CoCoA (Content reconstruction via Collaborative Attention), a novel pre-training paradigm designed to overcome a fundamental limitation in multimodal large language models (MLLMs) like Qwen2-VL. While MLLMs excel at generation tasks, their causal attention and next-token prediction architecture do not inherently produce the globally compact representations needed for high-quality multimodal embeddings, which are crucial for retrieval and classification. CoCoA addresses this by restructuring the model's attention flow and introducing a novel training objective where the model must reconstruct the original input content from the embedding of the <EOS> (end-of-sequence) token. This forces the model to compress the full semantic information of an image or text into that single token's representation.
The technical innovation lies in shifting from a purely generative objective to one that explicitly optimizes for representation quality. By using this EOS-based reconstruction task as a pre-training step, CoCoA lays a superior foundation for subsequent contrastive fine-tuning. Experiments on the MMEB-V1 benchmark show that applying CoCoA to models like Qwen2-VL and Qwen2.5-VL significantly boosts their embedding performance. The research demonstrates that content reconstruction is a powerful, data-efficient strategy to 'raise the performance ceiling' of existing multimodal models, maximizing their value for embedding tasks without requiring massive new datasets. This work provides a new architectural pathway to build better backbones for AI systems that need to understand and search across images and text.
- Introduces CoCoA, a pre-training paradigm using EOS-token content reconstruction to create better multimodal embeddings.
- Restructures attention flow in MLLMs like Qwen2-VL, forcing semantic compression into a single token representation.
- Shown to significantly improve embedding quality on the MMEB-V1 benchmark, offering a data-efficient performance boost.
Why It Matters
Enables more accurate AI for search and recommendation by creating better compressed representations of images and text.