Research & Papers

Benchmarking Deflection and Hallucination in Large Vision-Language Models

A new benchmark of 2,775 samples shows 20 leading vision-language models struggle with uncertainty.

Deep Dive

A team of researchers from the University of Cambridge and Google has published a groundbreaking new benchmark, VLM-DeflectionBench, that exposes a critical weakness in today's most advanced vision-language models. The study, accepted to ACL 2026, systematically tests 20 state-of-the-art LVLMs—including models like GPT-4V and Claude 3.5—on their ability to handle uncertainty and conflicting information. The core finding is stark: when presented with noisy, misleading, or incomplete visual and textual evidence, these models overwhelmingly fail to appropriately deflect by saying 'I cannot answer' or similar phrases. Instead, they often confidently generate incorrect or hallucinated responses.

The benchmark itself is a significant technical contribution, consisting of 2,775 carefully curated samples spanning diverse multimodal retrieval scenarios. To address the rapid obsolescence of existing benchmarks—where models can answer questions from their massive training data without needing retrieval—the researchers developed a dynamic data curation pipeline. This pipeline filters for samples that are genuinely retrieval-dependent, ensuring the benchmark remains challenging over time. The evaluation protocol disentangles four key scenarios, separating a model's parametric memorization from its true robustness when using external knowledge bases (KB-VQA). All resources from the study will be made publicly available, providing a reusable and extensible tool for the AI community to build more reliable and honest multimodal systems.

Key Points
  • VLM-DeflectionBench contains 2,775 samples designed to test model behavior with conflicting or insufficient evidence.
  • Tested 20 leading LVLMs, finding they usually fail to deflect and instead generate incorrect answers.
  • Uses a dynamic curation pipeline to filter for retrieval-dependent questions, preventing benchmark obsolescence as training data grows.

Why It Matters

For deploying reliable AI assistants, knowing when a model is uncertain is as crucial as knowing when it's correct.