GPT-5.2 takes 142 hours vs Claude's 5.5 hours for similar benchmark scores
A new benchmark reveals a shocking 26x runtime difference between top AI models.
METR's Time Horizon 1.1 benchmark data shows a massive efficiency gap. GPT-5.2 required 142.4 hours of total 'working_time' to achieve a p50 horizon of 394 minutes. Claude Opus 4.5 achieved a 320-minute horizon in just 5.5 hours—26 times faster. This suggests Claude is far more computationally efficient per unit of performance, though differences in evaluation 'scaffolds' may confound direct comparisons. The finding sparks debate about how to properly measure AI efficiency.
Why It Matters
This exposes a potential hidden cost of raw performance, forcing developers to weigh speed against capability.