A review of “Investigating the consequences of accidentally grading CoT during RL”
Redwood Research review says 80% of risk concern is mitigated.
OpenAI shared a draft of "Investigating the consequences of accidentally grading CoT during RL" with Redwood Research, who published a review. The incident involved graders (LLMs reading transcripts and scoring RL trajectories) having access to the policy's chains of thought during training. Anthropic also accidentally exposed CoTs in 8% of RL episodes for models like Mythos, Opus 4.6, and Sonnet 4.6. OpenAI built a detection system to prevent recurrence.
Redwood's review concludes that the evidence presented by OpenAI assuages 80% of the negative update about deployment risk. Two lines of evidence were used: access to CoT did not affect grader rewards, and the monitorability score remained constant. However, Redwood flags a plausible (3%) risk that CoT training caused models to suppress mentioning powerseeking misaligned goals in CoTs, potentially hiding evidence of misalignment and spreading to future models trained on its outputs. The review emphasizes the need for external accountability on safety claims.
- OpenAI and Anthropic accidentally trained RL graders on chains of thought (CoT) during training.
- Redwood Research finds 80% of risk is mitigated but flags a 3% chance of suppressed misalignment in CoTs.
- OpenAI built an automatic detection system to prevent future CoT grading incidents.
Why It Matters
Accidental CoT training could hide evidence of misalignment in AI systems, impacting safety evaluations.