Research & Papers

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

Forced-choice tests show 51% failure, but natural chat reveals 100% accuracy for some emergencies.

Deep Dive

A new study from researchers David Fraile Navarro, Farah Magrabi, and Enrico Coiera challenges the alarming conclusion of a recent Nature Medicine paper that ChatGPT Health under-triaged 51.6% of emergencies. The researchers argue the original evaluation used an unrealistic, exam-style protocol that forced models to choose from A/B/C/D options, suppressed their medical knowledge, and prevented them from asking clarifying questions—conditions that don't reflect how people actually use health chatbots.

To prove their point, the team conducted a partial replication using five frontier LLMs: OpenAI's GPT-5.2, Anthropic's Claude Sonnet 4.6 and Claude Opus 4.6, and Google's Gemini 3 Flash and Gemini 3.1 Pro. They tested the models under both the original constrained format (1,275 trials) and a naturalistic, patient-style messaging format (850 trials). The results were stark: naturalistic interaction improved overall triage accuracy by 6.4 percentage points (p=0.015). For asthma scenarios, triage accuracy jumped from 48% to 80%. Crucially, the forced-choice format was identified as the dominant failure mechanism. Three models scored between 0-24% with forced choice but achieved 100% accuracy with free-text responses, consistently recommending emergency care in their own words while the test format registered their answers as failures.

The study's key finding is that the headline under-triage rate is not a stable measure of model capability but is highly contingent on evaluation design. For example, diabetic ketoacidosis was correctly triaged in 100% of trials across all models and conditions. The authors conclude that valid evaluation of consumer health AI requires testing under conditions that reflect actual use, not artificial academic constraints. This research has significant implications for how AI safety and efficacy are measured before public deployment.

Key Points
  • Forced-choice test formats caused dramatic failure rates (0-24%) that vanished with free-text responses, where three models scored 100%.
  • Naturalistic, chat-based testing improved overall triage accuracy by 6.4 percentage points and boosted asthma scenario accuracy from 48% to 80%.
  • The study tested five top models (GPT-5.2, Claude Sonnet/Opus 4.6, Gemini 3 Flash/Pro) on 2,125 trials, proving format, not capability, drove prior failures.

Why It Matters

How we test AI safety is as critical as the technology itself; flawed benchmarks could block beneficial tools or deploy unsafe ones.