We're actually running out of benchmarks to upper bound AI capabilities
Frontier models like Claude Opus 4.6 are saturating METR's Time Horizon suite, making capability upper bounds unreliable.
A new report from METR, a leading AI safety research organization, warns that the industry is running out of reliable benchmarks to measure and upper-bound the capabilities of frontier AI models. Models like Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.3 are now saturating sophisticated evaluation suites that were challenging just months ago. For instance, Claude Opus 4.6 succeeds at over 80% of tasks in METR's Time Horizon suite, a collection of long, complex tasks designed to measure how long an AI can work autonomously. This has pushed the model's estimated 'time horizon'—a key safety metric—to a 95% upper confidence bound of 60 hours, making precise capability assessment difficult.
The rapid saturation extends beyond METR's work. Academic benchmarks are also being maxed out, requiring constant, expensive updates. The situation is illustrated by Anthropic's own safety evaluations for Claude Opus 4.6: while it could rule out dangerous 'ASL-4' capabilities in previous models, Opus 4.6 maxed out those tests. Its final safety rating came not from a benchmark, but from an internal survey of 16 Anthropic researchers. This benchmark crisis is forcing a methodological shift across the field. Researchers are now exploring alternative approaches, including large-scale surveys, analysis of observational data on real-world AI use, and the creation of even more complex and expensive agentic benchmarks like τ2-Bench and Finance Agent to keep pace with AI progress.
- Claude Opus 4.6 achieves over 80% success on METR's Time Horizon suite, with a 95% upper bound time horizon of 60 hours.
- Key safety evaluations like Anthropic's ASL-4 assessment are being maxed out, forcing reliance on researcher surveys instead of automated tests.
- The rapid saturation of benchmarks like GPQA and Time Horizon is driving a shift to new evaluation methods like observational data and surveys.
Why It Matters
Without reliable benchmarks, tracking AI progress and assessing safety risks becomes significantly harder for developers and policymakers.