Study: VLMs don't uniformly beat LLMs in mimicking human reading
Multimodal training doesn't guarantee more human-like text processing, new study finds
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team of researchers led by Jinzhou Wu compared large language models (LLMs) and vision-language models (VLMs) to test whether multimodal pretraining makes text representations more human-like during natural reading. They used tightly matched model pairs under a strictly text-only setting to isolate the effect of multimodal training history from online visual input. Human alignment was measured using whole-cortex fMRI responses and synchronized eye-tracking saccades from a natural reading dataset.
Results show that multimodal pretraining does not confer a uniform, global advantage. VLMs only outperformed LLMs when sentences contained stronger visual semantic content, and this effect was visible in both brain activity and eye movement patterns. The findings challenge the assumption that adding vision to language models automatically improves their alignment with human cognition for reading tasks.
- VLMs showed no global advantage over LLMs in aligning with human brain activity during natural reading
- Selective VLM advantage emerged only for sentences with strong visual semantic content
- Study used whole-cortex fMRI and eye-tracking data to compare model-human alignment
Why It Matters
Challenges the assumption that multimodal models are inherently more human-like, guiding model design for language understanding.