AI Safety

Investigating the consequences of accidentally grading CoT during RL

GPT-5.4 Thinking, GPT-5.1-5.4 Instant, and mini models affected by internal error.

Deep Dive

OpenAI has reported an accidental safety breach in its RL training pipeline: chains of thought (CoT) were directly graded during reinforcement learning for multiple released GPT-5 reasoning models. The affected models include GPT-5.4 Thinking, GPT-5.4 Instant, GPT-5.3 Instant, GPT-5.2 Instant, GPT-5.1 Instant, GPT-5.3 mini, and GPT-5.4 mini. GPT-5.5 was not impacted. OpenAI’s policy explicitly prohibits grading CoT during RL, because doing so could incentivize models to produce misleading reasoning traces that hide misaligned thoughts. The accidental grading was discovered via a new automated monitoring system.

Despite the violation, OpenAI’s analyses found no clear reductions in CoT monitorability across these models. However, the company warns that with enough CoT optimization pressure, monitorability could degrade in principle. The incident does not change their policy but highlights the fragility of current safety infrastructure. OpenAI is working to improve internal processes to prevent future occurrences. The public can read the full technical report on the alignment blog.

Key Points
  • Accidental CoT grading occurred in GPT-5.4 Thinking, GPT-5.1–5.4 Instant, and GPT-5.3/5.4 mini models.
  • OpenAI's policy against direct CoT grading aims to preserve the monitorability of reasoning traces for safety.
  • Analyses found no clear reduction in CoT monitorability, but the company warns that repeated pressure could degrade it.

Why It Matters

This incident underscores the fragility of current AI alignment monitoring and the need for robust safety infrastructure.