EAGLE framework aligns visual evidence for better multi-agent VQA
Vision-language agents stop arguing and start looking at the same image regions.
A new research paper introduces EAGLE (Evidence-Aligned Grounded muLti-agent rEasoning), a training-free framework designed to improve multi-agent visual question answering (VQA). The authors—Yuhan Wang, Shuochen Chang, and a team of 11 researchers—identify a critical flaw in existing multi-agent VQA methods: they rely on textual discussions and answer-level agreement but ignore whether agents actually base their answers on the same visual evidence. This can lead to misleading consensus where agents agree on a wrong answer because they each hallucinated or misread different parts of an image.
EAGLE solves this by explicitly exposing each agent's grounding regions (the image areas it used to answer) as visual evidence. Agents then mutually verify this evidence, and the framework uses evidence consistency to guide final decision-making. The approach is training-free, so it works with any off-the-shelf VLM, and experiments across six diverse VQA benchmarks show it achieves the best average performance while remaining lightweight and interpretable. This makes EAGLE practical for deployment in real-world applications where reliable visual reasoning is critical, such as medical imaging, autonomous systems, and content moderation.
- EAGLE exposes each VLM agent's grounding regions as visual evidence, enabling mutual verification rather than just textual debate.
- The training-free framework works with any off-the-shelf vision-language model and achieves best average performance across 6 VQA benchmarks.
- Key insight: answer-level agreement is insufficient; aligned visual evidence is essential for trustworthy multi-agent consensus.
Why It Matters
Reduces hallucinations in multi-agent vision systems, making AI more reliable for critical visual tasks like diagnostics and autonomous sensing.