Conflicting biomedical evidence flips 25% of LLM predictions, study finds
Accuracy drops up to 25.2% when only the order of documents changes in RAG.
A new study from researchers Yikun Han, Mengfei Lan, and Halil Kilicoglu, accepted at BioNLP 2026, exposes a critical vulnerability in retrieval-augmented large language models (LLMs) used for biomedical question answering. The team introduced HealthContradict, a dataset designed to test models when retrieved evidence is incomplete, misleading, or contradictory. They evaluated six open-weight LLMs across five controlled evidence conditions: no context, correct-only, incorrect-only, and two mixed conditions where both correct and incorrect documents were present but in opposite orders.
Results show that even when the same two conflicting documents are present, simply reversing their order caused 11.4%–25.2% of predictions to flip. To address this, the authors propose a conflict-aware abstention score that combines model confidence with an evidence conflict detector. In the hardest conditions (incorrect-only and incorrect-first conflicting), this score improved selective accuracy by 7.2–33.4 points and 3.6–14.4 points respectively across 75%, 50%, and 25% coverage rates. The findings argue that evaluation metrics and deployment strategies must explicitly account for evidence disagreement to ensure reliable biomedical AI.
- HealthContradict dataset tests LLMs on conflicting biomedical evidence, revealing order-dependent prediction flips of 11.4%–25.2%.
- Six open-weight LLMs all showed significant accuracy drops when correct and incorrect documents were presented in different orders.
- A conflict-aware abstention score combining confidence and evidence conflict detection boosted selective accuracy by up to 33.4 points in the hardest scenarios.
Why It Matters
Biomedical AI must handle contradictory evidence robustly; this study provides methods to improve reliability and informed abstention.