Developer Tools

OpenAI GPT-5.5 Codex bug: reasoning tokens cluster at 516, degrading complex tasks

New bug report shows GPT-5.5's reasoning tokens spike exactly at 516, linked to wrong answers.

Deep Dive

A detailed bug report filed on OpenAI’s Codex GitHub repository (#30364) presents strong evidence that GPT-5.5’s reasoning token allocation is behaving abnormally. Over a five-month window (Feb–Jun 2026), the reporter analyzed 390,195 response-level token records across 865 sessions. The core finding: GPT-5.5 responses cluster heavily at exactly 516 reasoning output tokens, with secondary spikes at 1034 and 1552 — numbers that look like fixed threshold boundaries rather than natural variation. GPT-5.5 represents only 19.3% of all responses but accounts for 82% of exact-516 events. Its ratio of exact-516 to responses with ≥516 tokens is 44.0%, compared to just 1.3% for other models — a 33.6x difference. The monthly trend is even more alarming: exact-516 clustering jumped from 0.11% in February to 53.3% in May 2026.

The pattern coincides with a sharp drop in overall reasoning-token intensity. Mean reasoning tokens for GPT-5.5 fell from 268.1 in February to 106.9 in May, and P90 tokens dropped from 772 to 344. This suggests that GPT-5.5 may be applying a hidden reasoning budget, capping chains of thought at these fixed values — and that this budgeted reasoning is less effective for complex Codex tasks. The report links to a previous issue (#29353) where a task reproduction showed that responses ending at exactly 516 reasoning tokens returned the wrong answer. The reporter asks OpenAI to investigate whether this is an intended budget cap, a routing degradation, or a scheduling bug, noting that the fixed token counts resemble throttled or fallback behavior. If confirmed, this could explain perceived quality regressions in GPT-5.5 for high-stakes coding scenarios.

Key Points
  • GPT-5.5 accounts for 82% of exact-516 reasoning token events despite only 19.3% of total responses; its 44.0% exact-516 ratio is 33.6x higher than other models.
  • Mean reasoning tokens dropped from 268.1 (Feb) to 106.9 (May), while exact-516 clustering surged from 0.11% to 53.3% in the same period.
  • The fixed token boundaries (516, 1034, 1552) and link to wrong answers (#29353) suggest an internal reasoning budget or truncation affecting complex coding tasks.

Why It Matters

If GPT-5.5 silently truncates reasoning, developers relying on Codex for complex work may get incorrect results without warning.

📬 Get the top 10 AI stories daily