AI Safety

Estimating METR Time Horizons for Claude Opus 4.6 and GPT 5.3 Codex (xhigh)

The latest leaked benchmark scores reveal a surprising winner in the AI race...

Deep Dive

A new analysis estimates the METR time horizons for Claude Opus 4.6 and the rumored GPT 5.3 Codex, a key benchmark measuring how long tasks take a human expert that an AI can one-shot. The crowd-sourced prediction expects GPT 5.3 Codex to lead with an 8.7-hour horizon versus Opus's 7.9 hours. The methodology extends the Epoch Capabilities Index using agentic benchmarks like SWE-Bench Pro to model these critical performance metrics.

Why It Matters

These estimates are a leading indicator of which model will dominate complex, real-world software and reasoning tasks for developers.