Removing up to 50% of image tokens barely reduces accuracy on the hallucination benchmark tested?

Removing up to 50% of image tokens barely reduces accuracy on the hallucination benchmark tested

VLMs show increased similarity among visual tokens in deeper layers, reducing sensitivity to fine-grained details?

VLMs show increased similarity among visual tokens in deeper layers, reducing sensitivity to fine-grained details

Study tested multiple VLMs across global degradation, localized occlusion, and answer-space expansion?

Study tested multiple VLMs across global degradation, localized occlusion, and answer-space expansion

Research & Papers

New study: Vision-language models pass benchmarks without really looking at images

arXiv cs.CV May 25, 2026

⚡Removing 50% of image tokens barely hurts performance—are benchmarks broken?

Deep Dive

A new study found that removing substantial fractions of image tokens from vision-language models (VLMs) only slightly degrades scores on a widely used hallucination benchmark. Their systematic analysis across multiple VLMs shows models rely less on fine-grained visual evidence than accuracy suggests. The work, accepted to the CVPR 2026 workshop GRAIL-V, argues current benchmarks fail to reliably evaluate fine-grained visual grounding.

Key Points

Removing up to 50% of image tokens barely reduces accuracy on the hallucination benchmark tested
VLMs show increased similarity among visual tokens in deeper layers, reducing sensitivity to fine-grained details
Study tested multiple VLMs across global degradation, localized occlusion, and answer-space expansion

Why It Matters

Benchmarks that don't test actual vision could mislead progress—models may seem better than they really are.

Read Original Article

New study: Vision-language models pass benchmarks without really looking at images

Why It Matters

Related Articles

🚀 Stay Ahead in AI