DecisionBench: New benchmark reveals massive headroom for AI agent delegation
23,375 tasks, 11 models, and a 31% gap to perfect delegation discovered.
Researchers from an academic team have released DecisionBench, a standardized benchmark substrate designed to evaluate how well AI agents can delegate subtasks to other models in long-horizon workflows. The benchmark fixes a task suite covering GAIA, tau-bench, and BFCL multi-turn, a peer-model pool of 11 models from 7 vendor families (including GPT-4o, Claude 3.5, Llama 3, Gemini, Mistral, Command R+, and Qwen2), a delegation interface using call_model plus an optional read_profile channel, a deterministic skill-annotation layer, and a multi-axis metric suite that tracks quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is designed to be agnostic to how peer information is generated, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be tested against it. A reference sweep across five experimental conditions on 23,375 task instances produced three benchmark-level findings.
First, mean end-task quality was statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), meaning quality-only evaluations would miss the orchestration signal entirely. Second, routing fidelity-at-1 (the ability to pick the best-suited peer on the first try) ranged from a low of 7.5% to a high of 29.5% across conditions at near-equal mean quality, with the delivery channel (on-demand tool vs. preloaded description) dominating description content. Third, a counterfactual ceiling places perfect delegation 15 to 31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. The authors release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives. For AI engineers building multi-agent systems, this benchmark provides a rigorous way to measure and improve delegation algorithms—a critical capability for scaling autonomous workflows.
- DecisionBench evaluates AI agents' ability to delegate subtasks to other models in long-horizon workflows across 23,375 task instances.
- Tested 11 models from 7 vendors; routing fidelity-at-1 ranged from 7.5% to 29.5% across different awareness conditions.
- Perfect delegation ceiling is 15–31 percentage points higher than current performance, indicating major headroom for orchestration improvements.
Why It Matters
Shows current AI agents miss up to 31% delegation potential, guiding future orchestration research.