New study reveals why RL beats SFT for preserving LLM capabilities during fine-tuning
Qwen2.5-3B model shows RL preserves internal circuits 2x better than SFT on scientific QA.
A new arXiv paper from Jeanmely Rojas Nunez and colleagues investigates the mechanistic roots of catastrophic forgetting in LLM fine-tuning. While prior work noted that RL retains capabilities better than SFT, this study dives into why—by tracking how internal computational circuits change. They propose 'differential circuit vulnerability,' a head-level metric that quantifies circuit degradation during fine-tuning. Using Qwen2.5-3B-Instruct (Alibaba's 3B-parameter model) adapted for scientific question-answering, they compared RL (policy-gradient updates) against standard SFT.
The findings reveal a clear trade-off: SFT rapidly adapts to the target task but causes significant circuit disruption, leading to forgetting of previously learned capabilities. In contrast, RL preserves a much larger fraction of the base model's circuits, resulting in less catastrophic forgetting—though at the expense of slower task adaptation. The authors argue that circuit preservation is a key mechanistic reason for RL's robustness. Their code is open-sourced, offering practitioners a way to measure and potentially mitigate forgetting in their own fine-tuning pipelines.
- Introduced 'differential circuit vulnerability' to measure head-level circuit degradation during fine-tuning.
- On Qwen2.5-3B-Instruct, SFT caused substantially greater circuit disruption and forgetting compared to RL.
- RL preserved more base circuits but adapted slower to the target scientific QA task.
Why It Matters
Guides model developers to choose RL over SFT when retaining prior LLM capabilities is critical.