Research & Papers

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

Adding 'I don't know' to options actually makes LLMs choose wrong answers more often.

Deep Dive

A new research paper from Kevin Guo and colleagues introduces CLEAR (CLinical Evaluation of Ambiguity and Reliability), a framework designed to expose how noise and ambiguity degrade LLM reliability in real-world medical settings. Unlike standard exam-style benchmarks, CLEAR systematically perturbs the answer space—varying the number of plausible options, including a ground-truth or abstention choice, and changing the semantic framing of options. When applied to 17 state-of-the-art LLMs across three medical benchmarks, the results reveal critical flaws in current evaluation methods.

Notably, increasing the number of plausible answers significantly degrades a model's ability to both pick the correct answer and refrain from incorrect ones. Even more concerning, simply including an 'I don't know' (IDK) abstention option—meant to encourage caution—actually increased the rate of incorrect selections. The authors formalize this as a 'humility deficit': a growing gap between a model's ability to answer correctly and its willingness to abstain. Crucially, this deficit worsens with model scale, undermining the assumption that bigger models are inherently more reliable for high-stakes fields like medicine.

Key Points
  • CLEAR framework systematically perturbs answer spaces (plausible options, abstention framing, semantic shifts) to test real-world medical reliability.
  • Including 'I don't know' (IDK) as an abstention option paradoxically increased incorrect answer selections across 17 LLMs.
  • The 'humility deficit'—gap between correct answers and abstention—worsens with model scale, challenging the scaling hypothesis for reliability.

Why It Matters

Current medical AI benchmarks are misleading; real-world deployment requires testing under ambiguity to avoid dangerous misdiagnoses.