HealthCraft uses 3,987 seed entities, 24 MCP tools, and 2,255 binary criteria (515 safety-critical) to simulate emergency medicine?

HealthCraft uses 3,987 seed entities, 24 MCP tools, and 2,255 binary criteria (515 safety-critical) to simulate emergency medicine.

Claude Opus 4.6 scored Pass@1 24.8% with 27.5% safety failures; GPT-5.4 scored 12.6% with 34.0% safety failures?

Claude Opus 4.6 scored Pass@1 24.8% with 27.5% safety failures; GPT-5.4 scored 12.6% with 34.0% safety failures.

Multi-step workflow performance collapsed to 1% (Claude) and 0% (GPT-5.4), despite partial success on individual steps?

Multi-step workflow performance collapsed to 1% (Claude) and 0% (GPT-5.4), despite partial success on individual steps.

Research & Papers

HealthCraft exposes AI's near-zero performance on multi-step emergency medicine tasks

arXiv cs.LG May 23, 2026

⚡Claude Opus 4.6 and GPT-5.4 score <1% on realistic multi-step clinical workflows

Deep Dive

HealthCraft, introduced by researcher Brandon Dent, is the first public reinforcement-learning environment designed to evaluate trajectory-level safety of AI models in emergency medicine. Built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, it exposes 24 MCP tools and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. The environment includes 195 tasks across six categories, graded against 2,255 binary criteria, of which 515 are safety-critical. A post-hoc negative-class slate extends this to 205 tasks and 2,337 criteria.

Results on two frontier models are sobering: Claude Opus 4.6 achieved Pass@1 of 24.8% [21.5-28.4] and GPT-5.4 reached 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0% respectively. On multi-step workflows—the closest proxy to real emergency care—performance collapsed to near zero (Claude 1.0%, GPT-5.4 0.0%), despite partial competence on individual steps. The paper also notes that six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model appeared stronger, underscoring that infrastructure fidelity is part of the measurement. The environment, tasks, rubrics, and harness are released under Apache 2.0.

Key Points

HealthCraft uses 3,987 seed entities, 24 MCP tools, and 2,255 binary criteria (515 safety-critical) to simulate emergency medicine.
Claude Opus 4.6 scored Pass@1 24.8% with 27.5% safety failures; GPT-5.4 scored 12.6% with 34.0% safety failures.
Multi-step workflow performance collapsed to 1% (Claude) and 0% (GPT-5.4), despite partial success on individual steps.

Why It Matters

Static benchmarks miss catastrophic failures; HealthCraft shows even top models are unsafe for real clinical deployments.

Read Original Article

HealthCraft exposes AI's near-zero performance on multi-step emergency medicine tasks

Why It Matters

Related Articles

🚀 Stay Ahead in AI