Research & Papers

MOPD: New Distillation Method Learns from Both Successes and Failures

Multiple rollouts per prompt? MOPD uses peer successes and failures to train better LLMs.

Deep Dive

Large language models (LLMs) are often post-trained using sparse verifier rewards that only indicate overall trajectory success or failure, offering limited guidance on where reasoning breaks down. On-policy distillation (OPD) densifies supervision by training on student-generated trajectories, but existing OPD methods treat each rollout independently, ignoring the other attempts sampled for the same prompt. A new paper from researchers at multiple institutions introduces Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned framework that leverages the student's local rollout group to build richer teacher signals. MOPD conditions the teacher on both successful and failed rollouts — successes provide positive evidence for valid reasoning patterns, while failures offer structured negative examples of plausible mistakes to avoid. The study explores two peer-context constructions: positive peer imitation and contrastive success-failure conditioning.

The team tested MOPD on four challenging benchmarks: competitive programming, mathematical reasoning, scientific question answering, and tool-use tasks. Across all domains, MOPD consistently outperformed standard on-policy baselines. Further analysis of teacher signals revealed that mixing success and failure contexts leads to teacher scores that better align with actual verifier rewards — meaning the supervision becomes more faithful and instance-adaptive. The authors conclude that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples. This work has practical implications for efficiently training more capable LLMs by using data that is otherwise discarded.

Key Points
  • MOPD conditions teacher models on both successful and failed student rollouts for the same prompt, providing positive and negative evidence.
  • Tested on competitive programming, math reasoning, scientific QA, and tool-use benchmarks, consistently beating standard on-policy distillation.
  • Mixed success-failure contexts better align teacher scores with verifier rewards, producing more faithful and instance-adaptive supervision.

Why It Matters

MOPD makes LLM post-training more data-efficient by leveraging trial-and-error rollouts typically discarded, improving reasoning capabilities.