Estimating METR Time Horizons for Claude Opus 4.6 and GPT 5.3 Codex (xhigh)
The latest leaked benchmark scores reveal a surprising winner in the AI race...
Deep Dive
A new analysis estimates the METR time horizons for Claude Opus 4.6 and the rumored GPT 5.3 Codex, a key benchmark measuring how long tasks take a human expert that an AI can one-shot. The crowd-sourced prediction expects GPT 5.3 Codex to lead with an 8.7-hour horizon versus Opus's 7.9 hours. The methodology extends the Epoch Capabilities Index using agentic benchmarks like SWE-Bench Pro to model these critical performance metrics.
Why It Matters
These estimates are a leading indicator of which model will dominate complex, real-world software and reasoning tasks for developers.