Research & Papers

Study: VLMs don't uniformly beat LLMs in mimicking human reading

Multimodal training doesn't guarantee more human-like text processing, new study finds

Deep Dive

A team of researchers led by Jinzhou Wu compared large language models (LLMs) and vision-language models (VLMs) to test whether multimodal pretraining makes text representations more human-like during natural reading. They used tightly matched model pairs under a strictly text-only setting to isolate the effect of multimodal training history from online visual input. Human alignment was measured using whole-cortex fMRI responses and synchronized eye-tracking saccades from a natural reading dataset.

Results show that multimodal pretraining does not confer a uniform, global advantage. VLMs only outperformed LLMs when sentences contained stronger visual semantic content, and this effect was visible in both brain activity and eye movement patterns. The findings challenge the assumption that adding vision to language models automatically improves their alignment with human cognition for reading tasks.

Key Points
  • VLMs showed no global advantage over LLMs in aligning with human brain activity during natural reading
  • Selective VLM advantage emerged only for sentences with strong visual semantic content
  • Study used whole-cortex fMRI and eye-tracking data to compare model-human alignment

Why It Matters

Challenges the assumption that multimodal models are inherently more human-like, guiding model design for language understanding.