LessWrong Study: Off-Model SFT Hurts AI Reasoning - But Fix Is Quick
Training models on other AI's outputs drops reasoning by 30% on hard problems.
Researchers from LessWrong (SebastianP et al.) found that off-model supervised fine-tuning (SFT) degrades AI reasoning performance—especially on problems requiring long chains of reasoning like MATH-500 and Olympiads—even when learning from a stronger teacher model like Claude Opus 4.7 or GPT-5.5. The cause appears to be an unfamiliar reasoning style, not teacher quality or perplexity. Crucially, a small amount of training on unrelated data restores performance, and the effect is context-specific: degradation only appears in certain prompt contexts, not others.
- Off-model SFT degrades reasoning tasks (MATH-500, Olympiads) by up to 30% but leaves simple tasks (MMLU) unaffected.
- Degradation occurs even when the teacher (e.g., Claude Opus 4.7) is stronger than the student, ruling out dumb-teacher hypothesis.
- Performance can be recovered by training on unrelated data, indicating the new reasoning style is shallow and reversible.
Why It Matters
Off-model SFT is vital for AI alignment; this research shows how to prevent capability loss in controlled models.