Research & Papers

Study: Vision-Language Models Produce Consistent but Wrong Spatial Answers

176 object-pair tracks reveal VLMs are confidently wrong in distance queries.

Deep Dive

A new paper from researchers S Divakar Bhat and Toshihiko Yamasaki (University of Tokyo) calls into question a common assumption in vision-language model (VLM) evaluation: that consistent predictions across different viewpoints reflect robust geometric grounding. The authors introduce ViewDiag, a controlled multi-view evaluation protocol built from the Hypersim, ScanNet, and KITTI360 datasets. It comprises 176 object-pair tracks across 80 scenes, with 2–10 views per track. The protocol evaluates models along three axes: metric accuracy of distance predictions, distributional concentration of those predictions, and a latent feature probe that distinguishes decision collapse (predictions become invariant regardless of viewpoint) from representation collapse (internal features stop distinguishing inputs).

Across multiple leading VLMs, the team observes a consistent pattern: high prediction stability paired with substantial error. Models cluster in a regime of strong consistency but low accuracy, meaning they give the same (wrong) answer no matter which angle they 'see' the scene from. This indicates that stable spatial predictions may result from prior-driven collapse rather than genuine evidence-sensitive reasoning. The findings directly challenge the widespread use of cross-view consistency as a proxy for geometric understanding. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone—critical for applications in robotics, autonomy, and embodied AI where precise metric reasoning is essential. The code and data are available on GitHub.

Key Points
  • ViewDiag protocol uses 176 object-pair tracks across 80 scenes from Hypersim, ScanNet, and KITTI360 with 2–10 views each.
  • Leading VLMs show high cross-view consistency but low metric accuracy, indicating 'decision collapse' rather than true geometric understanding.
  • The framework introduces a latent feature probe to distinguish between decision collapse and representation collapse in spatial reasoning.

Why It Matters

Challenges a core assumption in VLM evaluation, with direct impact on reliable spatial reasoning for robotics and autonomous systems.