ClinicalBench stress-tests assertion-aware retrieval, boosting clinical QA by 22pp
New benchmark finds 56% of AI-generated clinical answers are actually defective.
A new paper from Alex Stinard introduces ClinicalBench, a rigorous benchmark designed to test how well retrieval systems handle the messy reality of electronic health records (EHRs). Unlike traditional clinical QA tasks that assume clean inputs, ClinicalBench focuses on the step before reasoning: finding the right facts in real clinical notes where negation, temporality, and patient-versus-family attribution can flip a correct answer into a dangerous error. The benchmark comprises 400 questions across 43 MIMIC-IV patients, covering nine assertion-sensitive categories such as allergy histories, medication timelines, and family medical conditions.
The key innovation is EpiKG, a retrieval architecture that builds patient knowledge graphs with assertion labels and temporality tags, then routes retrieval based on question intent. Compared to a dense-RAG baseline (Contriever), EpiKG achieved a +8.84 percentage point improvement (p=1.79e-3) on the primary change-excluded endpoint, rising to +12.43 pp under oracle intent. The most striking result came from physician adjudication: three clinicians blindly evaluated 100 paired items and found that 56% of the auto-generated reference answers were defective. This methodological finding suggests that current clinical NLP benchmarks may be unreliable without human validation. The paper also reveals a negative correlation (r=-0.921, p=0.009) between LLM baseline performance and the gain from EpiKG, though the authors note this may reflect regression to the mean rather than encoding replacing model size. ClinicalBench, the frozen evaluator, and the EpiKG output stack are publicly released.
- EpiKG uses intent-aware KG-RAG with assertion labels and temporality tags, improving over dense-RAG by +8.84 pp (p=1.79e-3)
- Across 6 LLMs, primary endpoint showed +22 pp gain (p=0.0192) on 50 unanimous-strict items adjudicated by external physicians
- 56% of auto-generated reference answers were found defective by physician review, exposing flaws in current clinical QA benchmarks
Why It Matters
Accurate EHR retrieval is life-critical; ClinicalBench reveals hidden errors and sets a new standard for clinical AI evaluation.