GPT-5.5 high/xhigh is the first model to solve any task on ProgramBench, a new SWE benchmark?

GPT-5.5 high/xhigh is the first model to solve any task on ProgramBench, a new SWE benchmark.

It significantly outperforms Anthropic's Opus 4.7, which held the previous top score?

It significantly outperforms Anthropic's Opus 4.7, which held the previous top score.

The task required end-to-end reasoning across multiple files, not just isolated code generation.

Media & Culture

r/Singularity May 13, 2026

⚡A new SWE benchmark sees its first ever solution—and it's from GPT-5.5 high/xhigh.

Deep Dive

A Reddit post by user socoolandawesome shares links to tweets, a GitHub repository, and the ProgramBench website.

Key Points

GPT-5.5 high/xhigh is the first model to solve any task on ProgramBench, a new SWE benchmark.
It significantly outperforms Anthropic's Opus 4.7, which held the previous top score.
The task required end-to-end reasoning across multiple files, not just isolated code generation.

First AI to solve a complex SWE benchmark hints at autonomous coding agents becoming viable for real-world projects.