Research & Papers

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

arXiv q-bio.NC April 24, 2026

⚡When vision models agree, language models follow suit — by a factor of 2.

Deep Dive

A team led by Eghbal Hosseini (MIT), Brian Cheung (MIT), Evelina Fedorenko (Harvard/MIT), and Alex Williams (NYU) has published a paper at the ICLR 2026 Workshop on Representational Alignment that introduces a novel methodology for measuring how individual stimuli drive convergence across neural network representations. Using the Generalized Procrustes Algorithm, they quantified intra-modal dispersion — the degree of agreement among vision models (e.g., DINOv2, ResNet) on a single image — and found it strongly predicts cross-modal alignment with language models.

Specifically, images that elicited high agreement among vision models (low intra-modal dispersion) produced up to 2x higher alignment between vision and language model representations compared to images with high dispersion. This effect held across multiple model pairings and stimulus selection criteria. The findings provide a path to understanding why some inputs produce convergent representations across modalities and architectures, and may help explain how AI systems align with human neural representations. The work offers practical guidance for designing multimodal models that more reliably converge on shared representations.

Key Points

Introduced Generalized Procrustes Algorithm to measure intra-modal dispersion at single-stimulus level across vision models
Low-dispersion stimuli boosted cross-modal alignment between DINOv2 and language models by up to 2x
Effect robust across multiple model pairings and stimulus selection criteria

Why It Matters

Explains why some images cause AI models to agree across modalities, improving multimodal AI design.

Read Original Article

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Why It Matters

Stay Ahead in AI