New Study Reveals When On-Policy Distillation Fails for LLMs
Training models on their own outputs can backfire—here's how to fix it.
A new arXiv paper by Siqi Zhu and colleagues provides the first systematic dissection of on-policy distillation (OPD) and on-policy self-distillation (OPSD) for large language models. These techniques, which use a model's own output trajectories to provide dense token-level supervision, have shown mixed results—sometimes improving system prompt internalization, other times causing degradation. The study pinpoints exactly when and why these methods break down.
Three key failure mechanisms are identified: (1) a distribution mismatch between teacher and student when the student conditions on its own generated prefixes; (2) optimization instability caused by biased gradients from TopK reverse-KL objectives; and (3) an OPSD-specific flaw where the student learns a privileged-information-free policy that averages across PI-conditioned teachers, which fails when PI is instance-specific (e.g., per-example reasoning). The authors demonstrate effective fixes: stop-gradient on TopK objectives, using RLVR-adapted teachers for reward-aware distillation, and stabilizing with SFT before applying self-distillation. For practitioners, this means OPD/OPSD can work well for system prompts or alignment preferences, but require careful validation for instance-sensitive tasks like mathematical reasoning.
- OPD for math reasoning is highly sensitive to teacher choice and loss formulation; naive application often degrades performance.
- OPSD fails when privileged information (PI) is instance-specific (e.g., per-example reasoning steps) but works when PI is a shared latent rule (e.g., system prompt or alignment preference).
- Fixes include stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students to avoid distribution mismatch and optimization instability.
Why It Matters
LLM developers can now avoid common pitfalls in self-distillation, saving training time and achieving more reliable model improvements.