Google DeepMind's AI Co-Clinician Outperforms GPT-5.4 in Doctor Tests But Still Trails Experienced Physicians
Doctors preferred DeepMind's AI over GPT-5.4 by 63-30 in blind evaluations.
In a blind primary-care evaluation published this week, Google DeepMind's AI co-clinician outperformed OpenAI's GPT-5.4-thinking-with-search by a significant margin—doctors preferred DeepMind's responses 63-30 across 98 realistic queries, and also favored it over an existing clinical AI tool 67-26. The test involved expert physicians rating answers without knowing which system produced them. Yet the AI still trailed experienced human physicians on two critical areas: catching red-flag warning signs and guiding patients through hands-on physical examinations. DeepMind researcher Alan Karthikesalingam emphasized the early stage of the work, stating, “While it’s early days, the promise is clear.” The system is built on a triadic-care model where AI supports treatment but physicians retain authority, a framing reinforced by one serious safety error logged during the evaluation.
Beyond the preference splits, DeepMind's system showed notable strengths on medication reasoning. On the RxQA benchmark—comprising 600 questions on active ingredients, drug interactions, and dosages vetted by licensed pharmacists—the AI co-clinician scored 73.3% on multiple-choice questions, narrowly edging GPT-5.4's 72.7%. But on open-ended medication answers, which require nuanced dosing context and patient-specific cautions, the gap widened dramatically: 95.0% answer quality versus 90.9% for OpenAI's model. Primary-care physicians using reference books scored 61.3% on multiple-choice, trailing both AIs. DeepMind positions the tool as a support system for doctors, especially for routine guidance and structured information gathering, while acknowledging the need for human oversight on high-risk triage moments.
- Doctors preferred DeepMind's AI co-clinician over GPT-5.4-thinking-with-search by 63-30 in a blind test of 98 primary-care queries.
- Experienced physicians still outperformed the AI on red-flag recognition and physical-exam guidance, highlighting a critical gap.
- DeepMind's system scored 95.0% on open-ended medication answers (vs GPT-5.4's 90.9%) on the RxQA benchmark, demonstrating strong clinical nuance.
Why It Matters
AI assistants can boost routine primary care but cannot yet replace doctors on high-risk triage decisions.