Models & Releases

GPT-5.5 tops coding at 88.7% as model price gaps hit 30x

Top reasoning models within 0.5%, but flagship costs 30x more than open-weight.

Deep Dive

OpenAI's GPT-5.5 leads coding benchmarks at 88.7% on SWE-bench Verified, beating Claude Opus 4.7 by just over one point. Claude Opus 4.7's new 2026 tokenizer, however, increases effective token count by ~35% for English text, raising the bill despite flat sticker pricing. DeepSeek V4 Pro Max delivers 80.6% coding performance at $0.43/$0.87 per million tokens, making it the strongest open-weight option.

In reasoning, Gemini 3.1 Pro (94.3% on GPQA Diamond) and Claude Opus 4.7 (94.2%) are statistically tied — the top four models span just 0.5 percentage points. GPT-5.5 reaches 93.5% with extended thinking effort, highlighting that effort-budget settings now matter more than model name. The key takeaway: no single model dominates across both axes, and the 30x price gap between flagships and cost-effective models means professionals must match model to task carefully.

Key Points
  • GPT-5.5 leads coding at 88.7% SWE-bench, but Claude Opus 4.7 (87.6%) and GPT-5.3-Codex (85.0%) are close.
  • Gemini 3.1 Pro tops reasoning at 94.3% GPQA Diamond; top four models within 0.5 points — statistically tied.
  • DeepSeek V4 Pro Max costs $0.43/$0.87 per million tokens (30x less than flagships) with 80.6% coding score.

Why It Matters

Professionals can now choose between near-parity performance at 30x cost difference, demanding precise model selection per task.