Research & Papers

[R] On Randomness in Agentic Evals

r/MachineLearning February 10, 2026

⚡The AI community's favorite benchmarks might be lying to you.

Deep Dive

A new paper reveals a critical flaw in how AI models are evaluated: single-run benchmark scores are far noisier than assumed. SWE-Bench-Verified scores can vary by 2.2 to 6.0 percentage points, making small claimed improvements statistically indistinguishable from random noise. This calls into question decisions on which models to deploy, research to fund, and tools to ship, as they may be based on unreliable evidence.

Why It Matters

Billions in funding and deployment decisions may be based on misleading, noisy benchmark results.

Read Original Article

[R] On Randomness in Agentic Evals

Why It Matters

Stay Ahead in AI