Research & Papers

Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

arXiv cs.CV February 19, 2026

⚡VLMs rely on text recognition, not vision, for spatial tasks, creating a 34-54 point performance gap.

Deep Dive

Researcher Yuval Levental's paper reveals a critical flaw in frontier Vision-Language Models (VLMs). When transcribing 15x15 grids, Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking scored 84-91% accuracy with text symbols (#, .). With identical visual input of non-text squares, performance collapsed to 60-73% accuracy. The study demonstrates VLMs use a high-fidelity text-recognition pathway for spatial reasoning, not their native visual processing, exposing a fundamental architectural limitation.

Why It Matters

This undermines trust in VLMs for real-world visual analysis like medical imaging, diagrams, or autonomous systems where text labels are absent.

Read Original Article

Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

Why It Matters

Stay Ahead in AI