I'm confused by the change in the METR trend
New analysis reveals a permanent shift in AI capability scaling, with task completion times improving faster post-2024.
A new analysis of the METR (Measuring AI Ability to Complete Long Tasks) benchmark has identified a significant and likely permanent acceleration in AI capability scaling. Posted by 'Expertium' on the LessWrong forum, the deep dive uses statistical methods like the Bayesian Information Criterion (BIC) to show that a piecewise linear function—with a breakpoint around February/March 2024—fits the benchmark data far better than a simple linear trend. This indicates the 'task horizon doubling time' (the rate at which AI models improve at completing long, complex tasks) fundamentally increased for models released in 2024 and later, moving beyond a random fluke or methodological artifact.
The investigation rules out Reinforcement Learning from Human Feedback (RLHF) as the primary cause, as it predates the shift. While native Chain-of-Thought (CoT) reasoning, popularized by models like OpenAI's o1-preview, is a strong contender, the timing is imperfect—the trend changed months before o1's release, and some non-CoT models are on the faster trend. Commenters suggest alternative explanations like Reinforcement Learning with Value-based Reasoning (RLVR). The core mystery is that this acceleration appears to be a sustained, underlying change in AI progress, not a one-time boost, yet its exact origin remains opaque, highlighting how key advancements in AI labs can still be obscured from public view.
- Statistical analysis (BIC) confirms a permanent acceleration in the METR benchmark's 'task horizon' trend starting Feb/Mar 2024.
- The change signifies post-2024 AI models improve at completing long tasks at a fundamentally faster rate than pre-2024 models.
- Chain-of-Thought reasoning is a leading but imperfect explanation; the true catalyst for the accelerated scaling remains unidentified.
Why It Matters
This signals a hidden phase change in AI development, meaning future models may surpass expectations faster, impacting all long-horizon planning.