[R] On Randomness in Agentic Evals
The AI community's favorite benchmarks might be lying to you.
Deep Dive
A new paper reveals a critical flaw in how AI models are evaluated: single-run benchmark scores are far noisier than assumed. SWE-Bench-Verified scores can vary by 2.2 to 6.0 percentage points, making small claimed improvements statistically indistinguishable from random noise. This calls into question decisions on which models to deploy, research to fund, and tools to ship, as they may be based on unreliable evidence.
Why It Matters
Billions in funding and deployment decisions may be based on misleading, noisy benchmark results.