AI Safety

GPT 5.3 Codex Reportedly Beats Claude Opus 4.6 in Key AI Benchmark

The latest leaked benchmark scores reveal a surprising winner in the AI race...

Deep Dive

A new analysis estimates the METR time horizons for Claude Opus 4.6 and the rumored GPT 5.3 Codex, a key benchmark measuring how long tasks take a human expert that an AI can one-shot. The crowd-sourced prediction expects GPT 5.3 Codex to lead with an 8.7-hour horizon versus Opus's 7.9 hours. The methodology extends the Epoch Capabilities Index using agentic benchmarks like SWE-Bench Pro to model these critical performance metrics.

Why It Matters

These estimates are a leading indicator of which model will dominate complex, real-world software and reasoning tasks for developers.

📬 Get the top 10 AI stories daily