New paper reveals major flaws in AI agent evaluation benchmarks
A 1MB replay script beats top AI agents without even looking at the screen.
A new paper on arXiv (2605.08261) from researchers at Amazon (Pierluca D'Oro et al.) exposes fundamental flaws in how Computer Use Agents (CUAs) are evaluated. The authors demonstrate that a trivial 1MB replay script—which blindly executes a recorded sequence of actions without ever observing the screen—outperforms frontier AI models on popular static benchmarks. They prove that in deterministic environments, this script's expected success rate equals the source agent's pass@k, meaning the benchmarks merely measure how well a model's actions can be replayed, not its ability to interact with interfaces. The paper traces these failures to two root causes: non-principled environment design (static, unsandboxed, or unreliably verified environments) and non-principled evaluation methodology (naive aggregation and misuse of pass@k for stateful UI interactions).
To address these issues, the authors propose PRISM, a set of five design principles for CUA environments: privileged verification, realistic environments, integrity-checked configurations, sandboxed execution, and multifactorial variability. They instantiate PRISM in DigiWorld, a new benchmark comprising 15 realistic sandboxed mobile applications that can evaluate agents in over 3.2 million verified unique configurations. Additionally, they develop an aggregation framework combining Wilson score intervals with hierarchical bootstrap to produce confidence intervals that correctly account for the nested structure of CUA benchmarks. The paper demonstrates that these methods are not optional refinements but prerequisites for meaningful CUA research, calling into question many prior evaluation results.
- A 1MB replay script that never observes the screen outperforms frontier AI models on static benchmarks, proving the benchmarks are fundamentally flawed.
- The authors define five PRISM design principles (privileged verification, realistic environments, integrity-checked configs, sandboxed execution, multifactorial variability) for proper CUA evaluation.
- DigiWorld, the new benchmark, includes 15 sandboxed mobile apps with over 3.2 million verified unique configurations and uses Wilson score intervals with hierarchical bootstrap for statistically sound results.
Why It Matters
This paper challenges the validity of most existing AI agent benchmarks, demanding a methodological overhaul for the entire field.