Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models
RL preserves base models' features while SFT introduces specialized ones that forget...
A new paper accepted at ACL 2026 by Dan Shi, Zhuowen Han, Simon Ostermann, and colleagues provides a mechanistic explanation for why reinforcement learning (RL)-based post-training generalizes better than supervised fine-tuning (SFT) in large language models. The researchers developed a feature-level analysis framework to probe internal activations across models trained from the same base model on identical data. Their key finding: SFT rapidly introduces many highly specialized features that stabilize early in training, leading to general capabilities forgetting, while RL induces more restrained and continually evolving feature changes that largely preserve the base model's representations.
The study goes further to identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Through feature-level interventions, the team confirmed causal roles: disabling these features significantly degrades RL models' generalization performance, while amplifying them improves base models' performance. This provides the first mechanistic evidence for RL's superior out-of-domain generalization, with implications for designing more efficient post-training strategies. The code is publicly available on GitHub.
- SFT introduces specialized features causing general capability forgetting, while RL preserves base model representations
- A compact, task-agnostic feature set mediates RL's generalization across diverse tasks
- Feature-level interventions confirm causal roles: disabling these features degrades RL performance, amplifying them boosts base models
Why It Matters
Explains why RL beats SFT for generalization, enabling better post-training strategies for LLMs.