AI Safety

LessWrong Study: Off-Model SFT Hurts AI Reasoning - But Fix Is Quick

Training models on other AI's outputs drops reasoning by 30% on hard problems.

Deep Dive

Researchers from LessWrong (SebastianP et al.) found that off-model supervised fine-tuning (SFT) degrades AI reasoning performance—especially on problems requiring long chains of reasoning like MATH-500 and Olympiads—even when learning from a stronger teacher model like Claude Opus 4.7 or GPT-5.5. The cause appears to be an unfamiliar reasoning style, not teacher quality or perplexity. Crucially, a small amount of training on unrelated data restores performance, and the effect is context-specific: degradation only appears in certain prompt contexts, not others.

Key Points
  • Off-model SFT degrades reasoning tasks (MATH-500, Olympiads) by up to 30% but leaves simple tasks (MMLU) unaffected.
  • Degradation occurs even when the teacher (e.g., Claude Opus 4.7) is stronger than the student, ruling out dumb-teacher hypothesis.
  • Performance can be recovered by training on unrelated data, indicating the new reasoning style is shallow and reversible.

Why It Matters

Off-model SFT is vital for AI alignment; this research shows how to prevent capability loss in controlled models.