Research & Papers

AgentAtlas exposes LLM agent evaluation flaws with 40% accuracy drop

Removing prompt hints drops model accuracy by 14-40 percentage points across all models.

Deep Dive

AgentAtlas, created by researchers Parsa Mazaheri and Kasra Mazaheri, moves beyond simplistic accuracy leaderboards for evaluating LLM agents that operate on codebases, browsers, operating systems, and tools. The framework introduces four components: a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover), a nine-category trajectory-failure taxonomy with two hierarchical labels (primary error source and impact), a methodology comparing taxonomy-aware vs. taxonomy-blind prompting, and a benchmark-coverage audit against six behavioral axes. A demonstration on 1,342 generated items from eight models (four frontier closed-source and four open-weight) under both prompt modes showed striking results: removing the explicit label menu dropped every model's trajectory accuracy by 14 to 40 percentage points, compressing all scores to a narrow 0.54–0.62 floor regardless of model family. No single model won on all three of control accuracy, trajectory diagnosis, and tool-context utility retention, highlighting the inadequacy of single-number benchmarks for real-world agent deployment.

Key Points
  • AgentAtlas introduces a six-state decision taxonomy (Act/Ask/Refuse/Stop/Confirm/Recover) and a nine-category failure taxonomy with hierarchical labels.
  • Removing explicit menu hints dropped trajectory accuracy by 14–40 pp for all eight tested models, compressing scores to a 0.54–0.62 floor.
  • No model excels across all metrics: control accuracy, trajectory diagnosis, and tool-context utility retention—current leaderboards misrepresent capability.

Why It Matters

Forces AI teams to rethink evaluation, moving beyond single accuracy scores to multi-dimensional testing for safer, more reliable agents.