GPT-5.5 high/xhigh becomes first to solve task on ProgramBench, beating Opus 4.7
A new SWE benchmark sees its first ever solution—and it's from GPT-5.5 high/xhigh.
Deep Dive
A Reddit post by user socoolandawesome shares links to tweets, a GitHub repository, and the ProgramBench website.
Key Points
- GPT-5.5 high/xhigh is the first model to solve any task on ProgramBench, a new SWE benchmark.
- It significantly outperforms Anthropic's Opus 4.7, which held the previous top score.
- The task required end-to-end reasoning across multiple files, not just isolated code generation.
Why It Matters
First AI to solve a complex SWE benchmark hints at autonomous coding agents becoming viable for real-world projects.