Multimodal Data Curation Through Ranked Retrieval
Modality gap shrunk 90% with smarter training pairs and embedding fusion
A new paper from Pratyush Muthukumar and six co-authors tackles two persistent problems in multimodal embeddings: embeddings often reflect modality (e.g., image vs. text) more than meaning, and the paired supervision used to train them is noisy. To fix this, they introduce Symmetric Nucleus Subsampling (SNS), which trims raw inputs and annotations to only the parts that best support each other—reducing noise in training pairs. They also propose the Expert Embedding Engine (EEE), a learned projection network that combines complementary embedding experts with a bias-aware objective, actively reducing modality-driven separation in the embedding space.
The results are striking: the framework collapses the modality gap by over 90% compared to base embedding experts. When used for data curation, datablends created with SNS+EEE outperformed stratified sampling and traditional curation baselines in downstream model performance. The work was accepted at ICLR DATA-FM 2026 and is available on arXiv. For AI teams building multimodal systems, this provides a practical recipe for cleaning training data and improving cross-modal retrieval without requiring massive new datasets.
- Symmetric Nucleus Subsampling (SNS) refines training pairs by trimming to mutually supportive portions, reducing noise from heterogeneous annotations.
- Expert Embedding Engine (EEE) combines multiple embedding experts via a learned projection with bias-aware objective to minimize modality-driven separation.
- Achieves over 90% reduction in modality gap and produces curation datablends that beat stratified sampling in downstream performance.
Why It Matters
A data-curation recipe that slashes cross-modal confusion, directly improving retrieval and model training efficiency for multimodal AI.