Models & Releases

OpenAI's GPT 5.5 beats Anthropic's Opus 4.7 on new ProgramBench benchmark

GPT 5.5 solves first task with fewer agent steps, outperforming Opus 4.7 significantly

Deep Dive

The ProgramBench team recently released a new benchmark for evaluating AI coding agents, and the results are already turning heads. Despite being excluded from the initial model set for a NeurIPS submission due to timing, OpenAI's GPT 5.5 has come out on top, significantly outperforming Anthropic's Claude Opus 4.7. In particular, GPT 5.5 was the first model to successfully solve the benchmark's first task, and it did so with much higher overall performance.

What's especially noteworthy is GPT 5.5's efficiency: it required far fewer agent steps than Opus 4.7. The model achieves this by bundling multiple commands together using `&&`, which reduces token usage and speeds up execution. This suggests that GPT 5.5 is not just more capable, but also more streamlined for real-world agentic workflows. The full analysis is available on the ProgramBench blog, with the team expressing surprise at the results given that GPT 5.5 was not even part of their original evaluation plan.

Key Points
  • GPT 5.5 is the first model to solve ProgramBench's initial task
  • It significantly outperformed Claude Opus 4.7 in overall benchmark scores
  • GPT 5.5 uses fewer agent steps by bundling commands with `&&`, improving token efficiency

Why It Matters

GPT 5.5's efficiency and performance edge could shift the competitive landscape for AI coding agents.