EHRBench: 1M clinical questions reveal LLM gaps in diagnosis, treatment, and prognosis
Nearly 1M real-world EHR questions test 30+ LLMs on three critical clinical tasks
Researchers from multiple institutions have released EHRBench, a large-scale benchmark that evaluates LLMs on clinical decision-making using real-world electronic health records. The benchmark is constructed through an automated pipeline that converts patient encounter data into structured templates, then instantiates them into nearly 1 million (960,067) question-answer pairs across three core tasks: diagnosis, treatment, and prognosis. To ensure reliability, the pipeline incorporates systematic knowledge-base verification to filter out hallucinated or ambiguous relations. This design balances scalability with the need for clinically meaningful evaluation.
EHRBench was used to benchmark over 30 representative LLMs, revealing consistent trends in model capability across all three tasks. The results show that while LLMs perform well on straightforward cases, they struggle with complex, multi-morbidity scenarios and tasks requiring nuanced clinical inference. These findings expose actionable gaps for improving LLM reliability in real-world clinical settings. The benchmark is accepted at KDD 2026 and is available via arXiv, providing a standardized tool for future research on AI-assisted clinical decision-making.
- 960,067 QA items automatically generated from EHR trajectories using a specialized LLM template pipeline
- Covers three inference-heavy clinical tasks: diagnosis, treatment, and prognosis
- 30+ LLMs tested; consistent capability gaps found in complex multi-condition scenarios
Why It Matters
Validates LLM performance on realistic clinical data, critical for safe and reliable AI deployment in healthcare.