Media & Culture

GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark

GPT-5.5 jumps to 97.5% on hard puzzles, beating Opus 4.6 by 2.8 points...

Deep Dive

OpenAI's latest GPT-5.5 model has made significant strides on the Extended NYT Connections Benchmark, a rigorous test of reasoning and pattern recognition. GPT-5.5 achieved 97.5% on high-difficulty puzzles (up from 94.0% for GPT-5.4) and 96.9% on medium difficulty (up from 93.6%). Even without explicit reasoning, it improved from 32.8% to 37.5%. This performance vaults GPT-5.5 past Anthropic's Opus 4.6 (which scored 94.7% on answered puzzles) into second place overall, behind only Google's Gemini 3.1 Pro. Notably, Opus 4.7 struggled, refusing to answer 54% of puzzles and scoring just 41.0% on high difficulty.

In the open-weight arena, Kimi K2.6 has become the new leader with a score of 91.4%, a dramatic jump from Kimi K2.5's 78.3%. DeepSeek also showed impressive gains: DeepSeek V4 Pro improved from 50.2% to 75.7%, while the faster DeepSeek V4 Flash scored 53.2%. Alibaba's Qwen 3.6 Max Preview scored 82.2%, outperforming Qwen 3.6 Plus (71.3%). Tencent's Hy3 Preview managed only 30.2%, and Ling 2.6 1T without reasoning scored just 10.8%. These results highlight the rapid pace of improvement in both proprietary and open-weight models, with reasoning capabilities becoming a key differentiator.

Key Points
  • GPT-5.5 scores 97.5% on high-difficulty puzzles, up from 94.0% in GPT-5.4
  • Kimi K2.6 becomes top open-weight model at 91.4%, up from 78.3%
  • DeepSeek V4 Pro jumps from 50.2% to 75.7%, while Opus 4.7 refuses 54% of puzzles

Why It Matters

GPT-5.5's leap shows reasoning gains are accelerating, with open models like Kimi K2.6 closing the gap.