VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding
Researchers discover 'visual evidence' heads in AI models—and hack them for massive gains.
A new paper reveals VERA, a training-free framework that identifies and leverages specialized 'Visual Evidence Retrieval' attention heads within Vision-Language Models. These heads are critical for locating visual cues during complex reasoning. By detecting model uncertainty and triggering explicit verbalization of this evidence, VERA dramatically improves long-context understanding. It delivers average relative improvements of 21.3% on Qwen3-VL-8B and 20.1% on GLM-4.1V across five benchmarks.
Why It Matters
This unlocks a simple, free performance boost for existing vision models, making them significantly better at complex visual reasoning tasks.