VLMs scored ~30% accuracy under occlusion and <10% under perspective ambiguity on SpatialUncertain?

VLMs scored ~30% accuracy under occlusion and <10% under perspective ambiguity on SpatialUncertain

Models frequently gave overconfident answers instead of abstaining when visual evidence was incomplete?

Models frequently gave overconfident answers instead of abstaining when visual evidence was incomplete

When additional views were available, models performed near random chance at selecting the correct resolving viewpoint?

When additional views were available, models performed near random chance at selecting the correct resolving viewpoint

Research & Papers

VLMs fail at spatial reasoning: 30% accuracy under occlusion, <10% with perspective tricks

Q: Models frequently gave overconfident answers instead of abstaining when visual evidence was incomplete?

Models frequently gave overconfident answers instead of abstaining when visual evidence was incomplete

Q: When additional views were available, models performed near random chance at selecting the correct resolving viewpoint?

When additional views were available, models performed near random chance at selecting the correct resolving viewpoint

arXiv cs.CV June 01, 2026

⚡New paper reveals VLMs don't know when to stay quiet on spatial questions

Deep Dive

A new study from UNC Chapel Hill and Google researchers challenges how we evaluate vision-language models (VLMs) on spatial reasoning. The team built SpatialUncertain, a controlled framework that introduces two common real-world challenges: occlusion (objects hidden from view) and perspective ambiguity (misleading visual cues from certain angles). While current benchmarks assume perfect observations and reward correct answers, SpatialUncertain forces models to decide whether a question can even be answered at all. Across multiple frontier open- and closed-source VLMs, results were sobering: average accuracy fell to ~30% under occlusion and below 10% under perspective ambiguity, with models confidently producing incorrect guesses rather than abstaining.

Even more troubling, when given multiple alternative viewpoints, models struggled to identify which angle would actually resolve the ambiguity—performing near random chance. The authors argue that the field must move beyond simply measuring answer correctness and start evaluating whether VLMs know when to abstain and how to seek reliable evidence. This has major implications for deploying VLMs in autonomous driving, robotics, or AR/VR where spatial uncertainty is the norm, not the exception.

Key Points

VLMs scored ~30% accuracy under occlusion and <10% under perspective ambiguity on SpatialUncertain
Models frequently gave overconfident answers instead of abstaining when visual evidence was incomplete
When additional views were available, models performed near random chance at selecting the correct resolving viewpoint

Why It Matters

Real-world spatial AI must know when to say 'I don't know'—a skill current VLMs lack entirely.

Read Original Article

VLMs fail at spatial reasoning: 30% accuracy under occlusion, <10% with perspective tricks

Why It Matters

Related Articles

🚀 Stay Ahead in AI