SceneDiver reduces visual hallucinations in VLMs and VLAs via a coarse-to-fine focus planning approach?

SceneDiver reduces visual hallucinations in VLMs and VLAs via a coarse-to-fine focus planning approach.

Method first builds a holistic scene graph, then iteratively decomposes tasks into sub-problems for better object focus?

Method first builds a holistic scene graph, then iteratively decomposes tasks into sub-problems for better object focus.

Research & Papers

SceneDiver: New method breaks perceptual bottleneck in vision-language AI

arXiv cs.CV June 04, 2026

⚡Vision models hallucinate less with coarse-to-fine focus planning — ICML 2026.

Deep Dive

Researchers from multiple institutions have introduced SceneDiver, a novel method that tackles the perceptual bottleneck in vision-language decision making. The problem: VLMs and VLAs often hallucinate objects in complex scenes, confusing task-relevant items with distractors. SceneDiver works in a coarse-to-fine manner: it first constructs a holistic scene graph for initial understanding, then iteratively decomposes the task into simpler sub-problems through cycles of recognition, understanding, and analysis. This focus plan generation leverages the long-term planning abilities of VLMs while a lightweight adapter distills this focus ability into VLAs for fast reactive control.

Evaluated on standard embodied AI benchmarks (robotic manipulation, navigation), SceneDiver substantially reduces visual hallucinations for both model types without sacrificing computational efficiency. The method is accepted at ICML 2026, with code and data publicly released. This work is especially relevant for real-world robotics where accurate perception under clutter is critical. By enabling models to systematically focus on relevant objects, SceneDiver moves beyond one-step attention methods that failed due to lack of deep scene understanding.

Key Points

SceneDiver reduces visual hallucinations in VLMs and VLAs via a coarse-to-fine focus planning approach.
Method first builds a holistic scene graph, then iteratively decomposes tasks into sub-problems for better object focus.
Accepted at ICML 2026 with code and data publicly released on GitHub.

Why It Matters

Makes embodied AI (robotics, navigation) more reliable by drastically cutting visual hallucinations.

Read Original Article

SceneDiver: New method breaks perceptual bottleneck in vision-language AI

Why It Matters

Related Articles

🚀 Stay Ahead in AI