Research & Papers

HealthCraft exposes AI's near-zero performance on multi-step emergency medicine tasks

Claude Opus 4.6 and GPT-5.4 score <1% on realistic multi-step clinical workflows

Deep Dive

HealthCraft, introduced by researcher Brandon Dent, is the first public reinforcement-learning environment designed to evaluate trajectory-level safety of AI models in emergency medicine. Built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, it exposes 24 MCP tools and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. The environment includes 195 tasks across six categories, graded against 2,255 binary criteria, of which 515 are safety-critical. A post-hoc negative-class slate extends this to 205 tasks and 2,337 criteria.

Results on two frontier models are sobering: Claude Opus 4.6 achieved Pass@1 of 24.8% [21.5-28.4] and GPT-5.4 reached 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0% respectively. On multi-step workflows—the closest proxy to real emergency care—performance collapsed to near zero (Claude 1.0%, GPT-5.4 0.0%), despite partial competence on individual steps. The paper also notes that six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model appeared stronger, underscoring that infrastructure fidelity is part of the measurement. The environment, tasks, rubrics, and harness are released under Apache 2.0.

Key Points
  • HealthCraft uses 3,987 seed entities, 24 MCP tools, and 2,255 binary criteria (515 safety-critical) to simulate emergency medicine.
  • Claude Opus 4.6 scored Pass@1 24.8% with 27.5% safety failures; GPT-5.4 scored 12.6% with 34.0% safety failures.
  • Multi-step workflow performance collapsed to 1% (Claude) and 0% (GPT-5.4), despite partial success on individual steps.

Why It Matters

Static benchmarks miss catastrophic failures; HealthCraft shows even top models are unsafe for real clinical deployments.