Instruction-Tuned MLLMs Align Closer to Human Brain Activity Than Unimodal Models
IT-MLLMs predict fMRI responses 9% better than ICL models during movie watching.
A new study on arXiv (2506.08277) by Oota et al. probes how instruction-tuned multimodal large language models (IT-MLLMs) align with human brain activity. Using fMRI recordings from participants watching naturalistic movie clips (video with audio), the team extracted representations from six video and two audio IT-MLLMs under 13 task instructions. They compared brain alignment — how well model representations predict voxel-wise fMRI responses — against several baselines: in-context learning (ICL) multimodal models, non-instruction-tuned multimodal models, and unimodal models (text-only or vision-only).
The results show a clear hierarchy: IT-MLLMs achieved the highest brain alignment — ~9% better than ICL models, ~15% better than non-tuned multimodal, and ~20% better than unimodal baselines. Interestingly, ICL models exhibited strong semantic organization (Pearson r=0.78 with instruction-text embeddings), while IT models showed weak coupling to instruction semantics (r=0.14). This dissociation suggests that instruction tuning creates task-conditioned subspaces in the model's representational space, which align more closely with how the brain processes naturalistic stimuli. The findings open new avenues for mapping joint information processing between AI and biological neural systems.
- IT-MLLMs achieved 9% higher brain alignment than ICL models and 20% higher than unimodal baselines.
- ICL models show strong semantic organization (r=0.78) while IT models show weak coupling to instruction semantics (r=0.14).
- Task-specific instructions create distinct neural representations across brain regions, enabling more biologically aligned AI.
Why It Matters
This research reveals how instruction-tuned models mirror brain function, guiding the development of more biologically plausible AI systems.