Research & Papers

SceneDiver: New method breaks perceptual bottleneck in vision-language AI

Vision models hallucinate less with coarse-to-fine focus planning — ICML 2026.

Deep Dive

Researchers from multiple institutions have introduced SceneDiver, a novel method that tackles the perceptual bottleneck in vision-language decision making. The problem: VLMs and VLAs often hallucinate objects in complex scenes, confusing task-relevant items with distractors. SceneDiver works in a coarse-to-fine manner: it first constructs a holistic scene graph for initial understanding, then iteratively decomposes the task into simpler sub-problems through cycles of recognition, understanding, and analysis. This focus plan generation leverages the long-term planning abilities of VLMs while a lightweight adapter distills this focus ability into VLAs for fast reactive control.

Evaluated on standard embodied AI benchmarks (robotic manipulation, navigation), SceneDiver substantially reduces visual hallucinations for both model types without sacrificing computational efficiency. The method is accepted at ICML 2026, with code and data publicly released. This work is especially relevant for real-world robotics where accurate perception under clutter is critical. By enabling models to systematically focus on relevant objects, SceneDiver moves beyond one-step attention methods that failed due to lack of deep scene understanding.

Key Points
  • SceneDiver reduces visual hallucinations in VLMs and VLAs via a coarse-to-fine focus planning approach.
  • Method first builds a holistic scene graph, then iteratively decomposes tasks into sub-problems for better object focus.
  • Accepted at ICML 2026 with code and data publicly released on GitHub.

Why It Matters

Makes embodied AI (robotics, navigation) more reliable by drastically cutting visual hallucinations.