Research & Papers

New study: Vision-language models pass benchmarks without really looking at images

Removing 50% of image tokens barely hurts performance—are benchmarks broken?

Deep Dive

A new study found that removing substantial fractions of image tokens from vision-language models (VLMs) only slightly degrades scores on a widely used hallucination benchmark. Their systematic analysis across multiple VLMs shows models rely less on fine-grained visual evidence than accuracy suggests. The work, accepted to the CVPR 2026 workshop GRAIL-V, argues current benchmarks fail to reliably evaluate fine-grained visual grounding.

Key Points
  • Removing up to 50% of image tokens barely reduces accuracy on the hallucination benchmark tested
  • VLMs show increased similarity among visual tokens in deeper layers, reducing sensitivity to fine-grained details
  • Study tested multiple VLMs across global degradation, localized occlusion, and answer-space expansion

Why It Matters

Benchmarks that don't test actual vision could mislead progress—models may seem better than they really are.