Research & Papers

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

New benchmark reveals why VLMs see things that aren't there—and it's not just bad vision.

Deep Dive

A team of researchers led by JiYang Wang has released DO-Bench, a controlled diagnostic benchmark designed to pinpoint why vision-language models (VLMs) hallucinate objects—especially in binary object existence verification tasks. Rather than simply reporting aggregate accuracy, DO-Bench employs structured multimodal interventions to separate errors caused by perceptual limitations from those driven by contextual textual priors (e.g., language biases). The benchmark introduces two complementary dimensions: the 'Prior Override' dimension progressively strengthens textual context while keeping visual evidence constant, testing a model's resistance to prior pressure; the 'Perception-Limited' dimension incrementally enhances visual evidence from full-scene context to localized object crops, measuring perceptual grounding strength. This paired design allows researchers to attribute errors to prior suppression failures, perceptual insufficiency, or their interaction. Two new diagnostic metrics—PriorRobust and PerceptionAbility—quantify these behaviors consistently.

Evaluations across a diverse set of open- and closed-source VLMs reveal systematic differences in how models handle prior bias and visual reliability. For instance, some models show high perceptual ability but are easily swayed by strong textual priors, while others resist textual bias but fail when visual evidence is limited. The findings demonstrate that object hallucination is not a monolithic issue; it reflects heterogeneous, mechanism-dependent failure patterns that aggregate accuracy scores alone cannot capture. DO-Bench provides a much-needed framework for diagnosing and improving VLM reliability in safety-critical applications like autonomous driving, medical imaging, and content moderation, where hallucinations can have serious consequences.

Key Points
  • DO-Bench isolates hallucination causes via two dimensions: Prior Override (textual bias resistance) and Perception-Limited (visual grounding strength).
  • Introduces two diagnostic metrics: PriorRobust and PerceptionAbility, enabling consistent quantification of failure mechanisms.
  • Evaluations across multiple VLMs reveal systematic, mechanism-dependent failure patterns, not just aggregate accuracy differences.

Why It Matters

Provides a diagnostic framework to make VLMs more reliable in safety-critical applications like autonomous driving and medical imaging.