Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion
When vision models agree, language models follow suit — by a factor of 2.
A team led by Eghbal Hosseini (MIT), Brian Cheung (MIT), Evelina Fedorenko (Harvard/MIT), and Alex Williams (NYU) has published a paper at the ICLR 2026 Workshop on Representational Alignment that introduces a novel methodology for measuring how individual stimuli drive convergence across neural network representations. Using the Generalized Procrustes Algorithm, they quantified intra-modal dispersion — the degree of agreement among vision models (e.g., DINOv2, ResNet) on a single image — and found it strongly predicts cross-modal alignment with language models.
Specifically, images that elicited high agreement among vision models (low intra-modal dispersion) produced up to 2x higher alignment between vision and language model representations compared to images with high dispersion. This effect held across multiple model pairings and stimulus selection criteria. The findings provide a path to understanding why some inputs produce convergent representations across modalities and architectures, and may help explain how AI systems align with human neural representations. The work offers practical guidance for designing multimodal models that more reliably converge on shared representations.
- Introduced Generalized Procrustes Algorithm to measure intra-modal dispersion at single-stimulus level across vision models
- Low-dispersion stimuli boosted cross-modal alignment between DINOv2 and language models by up to 2x
- Effect robust across multiple model pairings and stimulus selection criteria
Why It Matters
Explains why some images cause AI models to agree across modalities, improving multimodal AI design.