GPT-5.3 codex (high) scored underwhelming results on METR
OpenAI's latest coding model falls short of expectations, scoring just 72% on the rigorous METR evaluation.
OpenAI's latest iteration of its code-specialized model, GPT-5.3 Codex, has generated discussion after reportedly scoring an underwhelming 72% on the METR benchmark. The METR framework is designed to rigorously evaluate text-to-text models on reasoning, knowledge, and coding capabilities. This score, shared via a Reddit post, indicates the model may not represent the significant leap forward many anticipated from the '5.3' versioning. While details are limited, the result highlights the ongoing challenges in scaling model performance predictably, especially for complex domains like code generation that require precise logic and planning. It serves as a reminder that version increments do not always guarantee proportional performance gains.
- GPT-5.3 Codex scored approximately 72% on the comprehensive METR evaluation benchmark.
- The result is considered underwhelming for a model with a '5.3' version increment from OpenAI.
- Performance suggests potential limitations in complex, multi-step reasoning crucial for advanced coding tasks.
Why It Matters
For developers, this signals that the newest AI coding assistants may not yet reliably handle advanced, real-world programming challenges.