Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
VLMs rely on text recognition, not vision, for spatial tasks, creating a 34-54 point performance gap.
Researcher Yuval Levental's paper reveals a critical flaw in frontier Vision-Language Models (VLMs). When transcribing 15x15 grids, Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking scored 84-91% accuracy with text symbols (#, .). With identical visual input of non-text squares, performance collapsed to 60-73% accuracy. The study demonstrates VLMs use a high-fidelity text-recognition pathway for spatial reasoning, not their native visual processing, exposing a fundamental architectural limitation.
Why It Matters
This undermines trust in VLMs for real-world visual analysis like medical imaging, diagrams, or autonomous systems where text labels are absent.