MedicalBench reveals LLMs struggle with implicit medical concept extraction
New benchmark tests LLMs on subtle medical meanings hidden in patient notes—results are sobering.
MedicalBench, developed by researchers from multiple institutions, is the first systematic benchmark for implicit, evidence-grounded medical concept extraction. Unlike prior benchmarks that focus on explicitly stated medical concepts, MedicalBench targets those that are implied but not directly mentioned in clinical narratives. Constructed from MIMIC-IV discharge summaries paired with human-verified ICD-10 codes, the dataset undergoes a multi-stage LLM triage pipeline plus medical annotation and expert review. It deliberately includes implicit positives (conditions inferred rather than stated), semantically confusable negatives (similar-sounding but incorrect conditions), and cases where LLM judgments diverge from medical experts. The benchmark defines two complementary tasks: (1) concept extraction as a verification task over note-concept pairs, and (2) sentence-level evidence retrieval, testing both correctness and interpretability.
When evaluated against state-of-the-art LLMs, performance remains modest across the board, underscoring the difficulty of extracting implicitly expressed medical concepts. Importantly, the benchmark isolates reasoning difficulty rather than superficial factors: performance is largely invariant to note length. This finding highlights a fundamental limitation of current LLMs in clinical settings, where subtle or omitted details often carry critical diagnostic weight. MedicalBench provides a rigorous foundation for developing medical language models that can both identify relevant concepts and justify their predictions transparently and faithfully—a crucial step toward trustworthy healthcare AI.
- First systematic benchmark for implicit, evidence-grounded medical concept extraction from EHRs, using MIMIC-IV discharge summaries and ICD-10 codes.
- Dataset includes implicit positives, semantically confusable negatives, and cases where LLMs disagree with medical expert assessments.
- State-of-the-art LLMs show modest performance that is invariant to note length, isolating reasoning difficulty as the core challenge.
Why It Matters
For healthcare AI, this benchmark exposes a critical gap: LLMs can't reliably infer unstated medical conditions from narratives.