Research & Papers

CHI-Bench: AI agents fail 72% of complex healthcare workflows

Top AI models struggle with policy-rich, multi-role healthcare tasks—only 28% success rate.

Deep Dive

CHI-Bench, released by a team of 33 researchers from institutions like Carnegie Mellon, University of Chicago, and Salesforce AI, is the first benchmark explicitly designed to stress-test AI agents on end-to-end healthcare workflows that require policy adherence, multi-role coordination, and multilateral dialog. The benchmark covers three domains: provider prior authorization, payer utilization management, and care management. Each task places an agent in a high-fidelity simulator of 20 healthcare applications connected through 87 MCP (Model Context Protocol) tools. Agents must navigate a 1,290+ document managed-care operations handbook to make decisions, write artifacts, and engage in peer reviews or patient outreach—all while playing multiple roles with handoffs.

The results are sobering. Among 30 different agent harness and model configurations tested, the best-performing agent achieved only a 28.0% pass rate on individual tasks. No agent managed to surpass 20% on the strict pass@3 metric, which requires consistent success across three independent runs. When researchers forced agents to execute all tasks in a single continuous session—mimicking real-world workflow batching—the success rate plummeted to just 3.8%. This drop highlights how current AI systems lack the long-horizon planning, memory, and error recovery needed for complex enterprise operations.

The authors hypothesize that similar performance gaps will surface in other enterprise domains that combine policy density (hundreds of rules), role composition (multiple identities), and irreversible steps (decisions that cannot be undone). CHI-Bench is publicly available with code and dataset on GitHub, inviting further research into agent architectures that can handle the complexity of real-world business workflows. The benchmark's design could spur development of better multi-agent coordination, policy-grounded reasoning, and robust task execution frameworks.

Key Points
  • Best agent achieved only 28% task completion across 30 model configurations; strict pass@3 stayed below 20%.
  • Single-session execution (mimicking real workflow batching) caused performance to plummet to 3.8%.
  • Benchmark uses 87 MCP tools across 20 simulated healthcare apps, guided by a 1,290+ document policy handbook.

Why It Matters

Healthcare automation faces a massive reality check—current AI can't handle policy-dense, multi-role workflows, delaying digital transformation.