Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI
New benchmark uses D.O.T.S. metric to evaluate AI's clinical reasoning in multi-step patient interactions.
A research team including Anna Kozlova, Stanislau Salavei, and three others has published Doctorina MedBench, a comprehensive new framework designed to evaluate the clinical competence of AI agents. Unlike traditional medical benchmarks that present static multiple-choice questions, Doctorina MedBench simulates end-to-end clinical encounters. In these simulations, an AI agent must engage in a multi-step dialogue to collect a patient's medical history, analyze attached lab reports and images, formulate a differential diagnosis, and provide personalized treatment recommendations. This approach aims to mirror the dynamic, information-gathering nature of real-world medicine.
The framework's core innovation is the D.O.T.S. evaluation metric, which scores agents across four components: the accuracy of their Diagnosis, the relevance of their Observations and requested Investigations, the appropriateness of their Treatment plan, and the efficiency of their dialogue Step Count. This allows developers to assess not just if an AI gets the right answer, but how it gets there. The benchmark currently contains over 1,000 diverse clinical cases spanning more than 750 diagnoses and includes safety-oriented 'trap cases' to test for critical errors. The authors argue this simulation-based method provides a more realistic and rigorous assessment of an AI's clinical reasoning and safety than exam-style tests, potentially benefiting both AI development and medical education.
- Simulates realistic clinical dialogues where AI must gather history and analyze documents, moving beyond simple Q&A.
- Uses the D.O.T.S. metric (Diagnosis, Observations, Treatment, Step Count) to evaluate both correctness and efficiency.
- Contains a dataset of 1,000+ clinical cases covering 750+ diagnoses, with built-in safety traps and regression testing.
Why It Matters
Provides a crucial, realistic safety check for medical AI agents before deployment, ensuring they can reason through complex patient interactions.