METR benchmarks measure task duration at which AI agents succeed with given reliability, plotted on a log scale?

METR benchmarks measure task duration at which AI agents succeed with given reliability, plotted on a log scale.

A piecewise log linear model with a breakpoint in March/April 2024 fits the data best, confirmed by AIC (lower is better)?

A piecewise log linear model with a breakpoint in March/April 2024 fits the data best, confirmed by AIC (lower is better).

The analysis suggests AI capability acceleration, with a hypothetical second jump in 2029 if trends continue?

The analysis suggests AI capability acceleration, with a hypothetical second jump in 2029 if trends continue.

AI Safety

METR Time Horizon Benchmarks Show AI Acceleration with Breakpoint in Early 2024

LessWrong AI June 10, 2026

⚡Frontier AI models' task completion times improved dramatically, with a clear acceleration starting March 2024.

Deep Dive

Vermillion's LessWrong post extends the METR (Model Evaluation & Threat Research) time horizon benchmarks for public frontier language models. These benchmarks measure the task duration (by human expert completion time) at which an AI agent succeeds with a given reliability—e.g., the 50%-time horizon is the duration at which the agent succeeds half the time. Plotting data on a log scale, Vermillion tested log linear, log quadratic, and piecewise log linear (two segmented lines) models. The piecewise model, with a breakpoint automatically detected in R Studio around March (50% threshold) and April (80% threshold) 2024, provided the best fit. The Akaike information criterion (AIC), which penalizes model complexity, confirmed the piecewise model is superior to simpler alternatives, ruling out overfitting concerns. This suggests AI capabilities are not just growing exponentially but experienced a discrete acceleration in early 2024.

Vermillion also hypothesizes a hypothetical second acceleration jump in 2029 of proportional magnitude, though no data supports that yet. A commenter (StanislavKrym) questioned whether a post-o3 slowdown exists, but Vermillion countered that post-2024 data appears straight, not indicating deceleration. The analysis implies that current frontier models are completing tasks of longer duration more rapidly than prior trends predicted, signaling an inflection point in AI progress. For professionals, this means timelines for autonomous AI agents capable of multi-hour or multi-day tasks may be shorter than previously assumed, with significant implications for labor automation and strategic planning.

Key Points

METR benchmarks measure task duration at which AI agents succeed with given reliability, plotted on a log scale.
A piecewise log linear model with a breakpoint in March/April 2024 fits the data best, confirmed by AIC (lower is better).
The analysis suggests AI capability acceleration, with a hypothetical second jump in 2029 if trends continue.

Why It Matters

AI capability growth may be accelerating faster than expected, impacting timelines for AGI development and automation.

Read Original Article

METR Time Horizon Benchmarks Show AI Acceleration with Breakpoint in Early 2024

Why It Matters

Related Articles

Stay Ahead in AI