Research & Papers

[D] METR TH1.1: “working_time” is wildly different across models. Quick breakdown + questions.

r/MachineLearning February 16, 2026

⚡A new benchmark reveals a shocking 26x runtime difference between top AI models.

Deep Dive

METR's Time Horizon 1.1 benchmark data shows a massive efficiency gap. GPT-5.2 required 142.4 hours of total 'working_time' to achieve a p50 horizon of 394 minutes. Claude Opus 4.5 achieved a 320-minute horizon in just 5.5 hours—26 times faster. This suggests Claude is far more computationally efficient per unit of performance, though differences in evaluation 'scaffolds' may confound direct comparisons. The finding sparks debate about how to properly measure AI efficiency.

Why It Matters

This exposes a potential hidden cost of raw performance, forcing developers to weigh speed against capability.

Read Original Article

[D] METR TH1.1: “working_time” is wildly different across models. Quick breakdown + questions.

Why It Matters

Stay Ahead in AI