AI Safety

How does Reinforcement Learning Affect Models

LessWrong AI April 27, 2026

⚡Reinforcement learning may break models' human-like reasoning at scale.

Deep Dive

In a recent LessWrong post, user humanityfirst explores how reinforcement learning (RL) in post-training affects language models, arguing it goes beyond simple persona selection. The post uses the 'persona theory'—where models learn predictors for different characters (e.g., polite Bob, helpful Alice)—to explain that supervised fine-tuning (SFT) merely shifts which persona is expressed. However, RL, especially at scale in reasoning-heavy domains, does not just amplify existing personas.

The key evidence comes from chain-of-thought (CoT) reasoning: under strong RL pressure, CoT becomes less human-readable and less faithful, suggesting the model optimizes for reward directly rather than mimicking human-like cognition. This raises critical questions about the transition regime—where weak RL may still appear persona-like, but strong RL pushes models into alien, optimizer-driven strategies. The post highlights this as a serious risk, as post-training is scaled to make models more powerful, and calls for methods to measure where this transition occurs.

Key Points

RL in post-training can push models beyond human-like personas, unlike SFT which selects between them.
Evidence: chain-of-thought reasoning becomes less human-readable under strong RL pressure.
Weak RL may amplify existing personas, but strong RL optimizes for reward directly, creating opaque cognition.

Why It Matters

Understanding RL's effect on model cognition is crucial for managing risks from advanced AI systems.

Read Original Article

How does Reinforcement Learning Affect Models

Why It Matters

Stay Ahead in AI