New study: Vision-language models pass benchmarks without really looking at images
Removing 50% of image tokens barely hurts performance—are benchmarks broken?
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Deep Dive
A new study found that removing substantial fractions of image tokens from vision-language models (VLMs) only slightly degrades scores on a widely used hallucination benchmark. Their systematic analysis across multiple VLMs shows models rely less on fine-grained visual evidence than accuracy suggests. The work, accepted to the CVPR 2026 workshop GRAIL-V, argues current benchmarks fail to reliably evaluate fine-grained visual grounding.
Key Points
- Removing up to 50% of image tokens barely reduces accuracy on the hallucination benchmark tested
- VLMs show increased similarity among visual tokens in deeper layers, reducing sensitivity to fine-grained details
- Study tested multiple VLMs across global degradation, localized occlusion, and answer-space expansion
Why It Matters
Benchmarks that don't test actual vision could mislead progress—models may seem better than they really are.