Research & Papers

New Medical Benchmark Exposes LLM Flaws: Top Model Scores Only 39.2%

A new contamination-free test reveals AI's shocking inability to handle real clinical cases.

Deep Dive

Researchers introduced LiveMedBench, a contamination-free medical benchmark that uses 2,756 real-world clinical cases to rigorously test LLMs. It reveals pervasive data contamination and a critical performance gap: the best model scored only 39.2%. The benchmark uses an automated rubric for evaluation, showing 84% of models perform worse on newer cases. The main failure point (35-48% of errors) is applying knowledge to specific patient contexts, not a lack of facts.

Why It Matters

This exposes a major reliability crisis for using AI in high-stakes healthcare, where inflated benchmarks hide dangerous shortcomings.

📬 Get the top 10 AI stories daily