LiveClin: A Live Clinical Benchmark without Leakage
Top medical AI models score just 35.7% on new benchmark built from 1,407 contemporary case reports.
Researchers led by Xidong Wang introduced LiveClin, a live clinical benchmark designed to prevent data contamination. Built from 1,407 contemporary case reports and 6,605 questions, it's updated biannually and validated by 239 physicians. Testing 26 models revealed a maximum Case Accuracy of only 35.7%, while human Chief Physicians outperformed most AI. This provides a continuously evolving framework to guide medical LLM development toward real-world clinical utility.
Why It Matters
Exposes the gap between AI performance on static tests and real-world medical reasoning, guiding development toward clinical reliability.