[D] METR TH1.1: “working_time” is wildly different across models. Quick breakdown + questions.
A new benchmark reveals a shocking 26x runtime difference between top AI models.
Deep Dive
METR's Time Horizon 1.1 benchmark data shows a massive efficiency gap. GPT-5.2 required 142.4 hours of total 'working_time' to achieve a p50 horizon of 394 minutes. Claude Opus 4.5 achieved a 320-minute horizon in just 5.5 hours—26 times faster. This suggests Claude is far more computationally efficient per unit of performance, though differences in evaluation 'scaffolds' may confound direct comparisons. The finding sparks debate about how to properly measure AI efficiency.
Why It Matters
This exposes a potential hidden cost of raw performance, forcing developers to weigh speed against capability.