Research & Papers

VLMs fail at spatial reasoning: 30% accuracy under occlusion, <10% with perspective tricks

New paper reveals VLMs don't know when to stay quiet on spatial questions

Deep Dive

A new study from UNC Chapel Hill and Google researchers challenges how we evaluate vision-language models (VLMs) on spatial reasoning. The team built SpatialUncertain, a controlled framework that introduces two common real-world challenges: occlusion (objects hidden from view) and perspective ambiguity (misleading visual cues from certain angles). While current benchmarks assume perfect observations and reward correct answers, SpatialUncertain forces models to decide whether a question can even be answered at all. Across multiple frontier open- and closed-source VLMs, results were sobering: average accuracy fell to ~30% under occlusion and below 10% under perspective ambiguity, with models confidently producing incorrect guesses rather than abstaining.

Even more troubling, when given multiple alternative viewpoints, models struggled to identify which angle would actually resolve the ambiguity—performing near random chance. The authors argue that the field must move beyond simply measuring answer correctness and start evaluating whether VLMs know when to abstain and how to seek reliable evidence. This has major implications for deploying VLMs in autonomous driving, robotics, or AR/VR where spatial uncertainty is the norm, not the exception.

Key Points
  • VLMs scored ~30% accuracy under occlusion and <10% under perspective ambiguity on SpatialUncertain
  • Models frequently gave overconfident answers instead of abstaining when visual evidence was incomplete
  • When additional views were available, models performed near random chance at selecting the correct resolving viewpoint

Why It Matters

Real-world spatial AI must know when to say 'I don't know'—a skill current VLMs lack entirely.