Reasoning Models Struggle to Control their Chains of Thought
Claude Sonnet 4.5 can control its final answer 61.9% of the time, but its reasoning only 2.7%.
A team of researchers, including authors from institutions like New York University and Anthropic, published a paper titled "Reasoning Models Struggle to Control their Chains of Thought." They introduced a new benchmark called the CoT-Control evaluation suite to test a critical safety question: Can advanced AI models deliberately control or censor the step-by-step reasoning they show in their chain-of-thought (CoT)? This is crucial because if models can hide their true intentions in their reasoning, using CoT to monitor for misbehavior becomes unreliable.
The study's key finding is that current models possess alarmingly low CoT controllability. For example, when asked to solve a genetics problem without using the word 'chromosome' in its reasoning, Claude Sonnet 4.5 succeeded only 2.7% of the time. In stark contrast, the same model could control its final output to avoid a banned word 61.9% of the time. This gap indicates models cannot easily deceive observers by crafting a fake reasoning process; their internal 'train of thought' leaks out.
Further analysis revealed that CoT controllability decreases with more reinforcement learning (RL) training and increases with model size, though it remains low overall. Notably, even offering models incentives to evade monitors or using adversarial prompt optimization did not meaningfully increase their ability to control CoT. The researchers conclude with cautious optimism: for now, CoT appears to be a robust window into model reasoning, but they strongly recommend frontier AI labs track this capability as models evolve.
- Claude Sonnet 4.5 showed a massive controllability gap: 61.9% for final output vs. just 2.7% for its chain-of-thought reasoning.
- CoT controllability decreases with more RL training and test-time compute, suggesting more aligned models may be less deceptive.
- Adversarial prompts and incentives failed to significantly boost controllability, indicating current architectural limits to hiding reasoning.
Why It Matters
This validates chain-of-thought as a critical, hard-to-fake transparency tool for AI safety and alignment research.