ABRA benchmark shows AI radiology agents fail at perception, not tools
Ten AI models score 89% execution but 0–25% on real radiology outcomes.
Existing medical-agent benchmarks feed AI models pre-selected images, not a dynamic environment. ABRA, from Maksudov et al., changes that by placing agents inside a full radiology workstation—the OHIF viewer and Orthanc DICOM server—with 21 API tools for slice navigation, windowing, series selection, pixel annotations, and structured reporting. It draws 655 tasks from datasets like LIDC-IDRI, Duke Breast Cancer MRI, and NLST LongCT, covering viewer controls, metadata QA, vision probes, annotations, longitudinal comparisons, and BI-RADS reporting, each scored on Planning, Execution, and Outcome.
Ten current models (five closed-weight, five open-weight) were evaluated. On real annotation tasks, all achieved at least 89% Execution (successful tool calls), but only 0–25% Outcome (correct end state). The bottleneck? Perception. When a simulated detector provided the finding (oracle variant), Outcome on the same task jumped to 69–100%. This cleanly isolates perception as the core weakness, not tool orchestration. The benchmark, task generators, and scorers are publicly released to help the community target the real challenge in medical AI agents.
- ABRA evaluates agents in a realistic radiology environment (OHIF viewer + DICOM server) with 21 function-calling tools.
- 655 programmatically generated tasks across 3 difficulty tiers and 8 types, including annotation, BI-RADS reporting, and longitudinal comparison.
- Models reach 89% Execution but only 0–25% Outcome on real annotations; with a simulated detector, Outcome reaches 69–100%, proving perception is the bottleneck.
Why It Matters
Shows AI radiology agents fail at perception, not tool orchestration—a key insight for clinical deployment.