MOPD: New Distillation Method Learns from Both Successes and Failures
Multiple rollouts per prompt? MOPD uses peer successes and failures to train better LLMs.
Large language models (LLMs) are often post-trained using sparse verifier rewards that only indicate overall trajectory success or failure, offering limited guidance on where reasoning breaks down. On-policy distillation (OPD) densifies supervision by training on student-generated trajectories, but existing OPD methods treat each rollout independently, ignoring the other attempts sampled for the same prompt. A new paper from researchers at multiple institutions introduces Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned framework that leverages the student's local rollout group to build richer teacher signals. MOPD conditions the teacher on both successful and failed rollouts — successes provide positive evidence for valid reasoning patterns, while failures offer structured negative examples of plausible mistakes to avoid. The study explores two peer-context constructions: positive peer imitation and contrastive success-failure conditioning.
The team tested MOPD on four challenging benchmarks: competitive programming, mathematical reasoning, scientific question answering, and tool-use tasks. Across all domains, MOPD consistently outperformed standard on-policy baselines. Further analysis of teacher signals revealed that mixing success and failure contexts leads to teacher scores that better align with actual verifier rewards — meaning the supervision becomes more faithful and instance-adaptive. The authors conclude that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples. This work has practical implications for efficiently training more capable LLMs by using data that is otherwise discarded.
- MOPD conditions teacher models on both successful and failed student rollouts for the same prompt, providing positive and negative evidence.
- Tested on competitive programming, math reasoning, scientific QA, and tool-use benchmarks, consistently beating standard on-policy distillation.
- Mixed success-failure contexts better align teacher scores with verifier rewards, producing more faithful and instance-adaptive supervision.
Why It Matters
MOPD makes LLM post-training more data-efficient by leveraging trial-and-error rollouts typically discarded, improving reasoning capabilities.