Research & Papers

[R] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

VLMs score 84% F1 reading text grids but collapse to 29-39% when shown identical square images.

Deep Dive

A viral research experiment reveals a critical weakness in top Vision-Language Models (VLMs) from Anthropic, OpenAI, and Google. When asked to transcribe 15×15 binary grids, Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking achieved ~84% F1 score when grids were rendered as text characters (. and #). However, performance collapsed to 29-39% F1—a 34-54 point drop—when the identical grids were rendered as filled squares, despite both conditions using the same visual encoder. Each model failed differently: Claude under-counted cells, ChatGPT over-counted, and Gemini hallucinated L-shaped patterns. The findings indicate current VLMs possess strong implicit OCR pipelines but lack equivalent mechanisms for processing non-textual spatial features, which impacts their ability to interpret charts, diagrams, and structured visual content.

Key Points
  • Performance gap of 34-54 F1 points: VLMs scored ~84% on text-character grids but only 29-39% on identical square-image grids.
  • Model-specific failure patterns: Claude Opus under-counted, ChatGPT 5.2 over-counted, and Gemini 3 Thinking hallucinated structured L-shaped templates.
  • Gemini showed strongest visual pathway at low density (68% F1) but completely collapsed above 32% grid density with systematic hallucinations.

Why It Matters

This weakness affects real-world applications involving charts, spreadsheets, and diagrams, revealing a fundamental gap in AI's visual understanding.