AI Safety

Reasoning Models Struggle to Control Their Chains of Thought

LessWrong AI March 06, 2026

⚡Claude Sonnet 4.5 controls its reasoning chain only 2.7% of the time, versus 61.9% for final outputs.

Deep Dive

A collaborative research team from OpenAI and the AI Control MATS Program has published a critical study revealing that advanced reasoning models struggle to control their internal chains of thought (CoT). The researchers introduced 'CoT-Control,' a comprehensive evaluation suite of 14,076 tasks designed to measure 'CoT controllability'—the ability of AI models to follow explicit instructions about what may or may not appear in their reasoning traces while solving problems. In a significant policy move, OpenAI announced it will begin formally tracking this controllability metric in future model releases, starting with GPT-5.4. This research addresses a core safety concern: as AI agents become more capable and autonomous, monitoring their reasoning processes (CoT monitoring) is a promising alignment technique, but its effectiveness depends on models not learning to strategically hide or manipulate their internal thoughts.

The study tested 13 frontier models and found alarmingly low CoT controllability. For instance, Claude Sonnet 4.5 succeeded in controlling its reasoning chain only 2.7% of the time, compared to a 61.9% success rate for controlling its final outputs. The research identified key trends: CoT controllability increases with model size but decreases with more reinforcement learning (RL) training, additional test-time compute, and increased problem difficulty. Notably, when models were made 'situationally aware' that they were being monitored, controllability increased by up to 4 percentage points. While the current low controllability is cautiously optimistic for safety, the underlying mechanisms are poorly understood, prompting the recommendation for labs to actively track this emergent capability as models scale.

Key Points

Claude Sonnet 4.5 showed a massive controllability gap: 2.7% for reasoning chains vs. 61.9% for final outputs.
CoT controllability decreases with more RL training; OLMo-3-7B-RL-Zero-Math's ability dropped by an order of magnitude post-training.
OpenAI commits to tracking CoT controllability as a key safety metric, starting with the upcoming GPT-5.4 model.

Why It Matters

This establishes a vital new benchmark for AI safety, ensuring we can trust what models are 'thinking' before they act autonomously.

Read Original Article

Reasoning Models Struggle to Control Their Chains of Thought

Why It Matters

Stay Ahead in AI