Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning
A new study reveals medical AI models achieve 63% accuracy while ignoring images 81% of the time.
A team of researchers has published a groundbreaking paper titled 'Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning,' revealing that current multimodal AI models for medical diagnosis are exploiting shortcuts and largely ignoring the visual data they're supposed to analyze. The study introduces a novel counterfactual evaluation framework that tests models with real, blank, and shuffled images across four major medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. The researchers discovered that text-only reinforcement learning with verifiable rewards (RLVR) can match or even outperform image-text RLVR on standard accuracy metrics, suggesting current evaluation protocols completely fail to measure causal visual dependence.
The technical findings are alarming: text-only RLVR achieved a negative Visual Reliance Score (-0.09) on PathVQA, meaning models performed better with mismatched images than with correct ones. On VQA-RAD, both text-only and image-text RLVR variants achieved 63% accuracy through different cheating mechanisms—text-only retained 81% performance with blank images, while image-text showed only 29% image sensitivity. Most concerningly, models generated visual claims in 68-74% of responses, yet 38-43% were completely ungrounded (measured by the new Hallucinated Visual Reasoning Rate metric). These results demonstrate that accuracy-only rewards enable dangerous shortcut exploitation, and progress in medical AI requires fundamentally new grounding-aware evaluation protocols and training objectives that explicitly enforce visual dependence.
- Text-only RLVR models achieved negative Visual Reliance Scores (-0.09), performing better with incorrect images than correct ones
- Models maintained 81% performance on VQA-RAD with blank images while generating visual claims in 68-74% of responses
- 38-43% of visual claims were completely ungrounded hallucinations, measured by the new HVRR metric
Why It Matters
This exposes critical flaws in medical AI evaluation, showing models can appear accurate while ignoring crucial visual evidence, risking misdiagnosis.