Research & Papers

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

A new contamination-free test reveals AI's shocking inability to handle real clinical cases.

Deep Dive

Researchers introduced LiveMedBench, a contamination-free medical benchmark that uses 2,756 real-world clinical cases to rigorously test LLMs. It reveals pervasive data contamination and a critical performance gap: the best model scored only 39.2%. The benchmark uses an automated rubric for evaluation, showing 84% of models perform worse on newer cases. The main failure point (35-48% of errors) is applying knowledge to specific patient contexts, not a lack of facts.

Why It Matters

This exposes a major reliability crisis for using AI in high-stakes healthcare, where inflated benchmarks hide dangerous shortcomings.