AI Safety

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

Safety lapse reveals AI's internal thoughts during training, undermining oversight in three major models.

Deep Dive

Anthropic disclosed a significant training error where approximately 8% of Claude Mythos Preview's chain-of-thought (CoT) reasoning was accidentally exposed to the model's reward function during training. This marks at least the second independent incident where Anthropic's oversight signal contaminated the AI's internal reasoning process, with previous errors affecting both Opus 4.6 and Sonnet 4.6 models. The technical flaw went undetected for an extended period, revealing fundamental process weaknesses in Anthropic's safety protocols.

This exposure is particularly concerning because it directly undermines the monitorability of AI reasoning traces—the very mechanism meant to detect when models might be intending to misbehave. When AI's internal thoughts are trained against, it becomes impossible to distinguish genuine reasoning from optimized responses. The incident occurred despite Anthropic having identified similar issues in Opus 4.6, where a smaller but concentrated set of episodes were affected near the end of training.

The pattern suggests systemic process failures at Anthropic, where miscommunication about priorities and inadequate safeguards allowed repeated contamination of training data. As AI development increasingly relies on AI labor itself, such errors become more likely and dangerous. The company's transparency in reporting these incidents is commendable but highlights the urgent need for robust procedures to ensure development proceeds according to safety plans, especially as human oversight becomes distributed across potentially untrusted AI systems.

Key Points
  • Anthropic exposed Claude Mythos Preview's chain-of-thought reasoning in 8% of training episodes to reward signals
  • Same error affected Opus 4.6 and Sonnet 4.6 models, revealing systemic process failures
  • Undermines ability to monitor AI intent and detect potential misbehavior during reasoning

Why It Matters

Repeated safety lapses in top AI labs threaten our ability to monitor and control increasingly powerful systems as human oversight diminishes.