Research & Papers

Distribution-Corrected Distillation Boosts Small LLMs on Math Tasks

New method fixes teacher-student drift without costly online sampling.

Deep Dive

A team from multiple institutions introduces Distribution Corrected Offline Data Distillation (DCOD) to address a fundamental problem in LLM reasoning distillation: when a small student model learns from static teacher-generated traces, it accumulates errors during autoregressive inference because its own prefixes differ from the teacher's. Existing offline methods suffer from this distributional drift, while on-policy distillation (e.g., self-distillation) avoids drift but requires expensive online sampling and produces low-quality early traces.

DCOD works within the efficient offline paradigm but actively corrects the drift by weighting teacher supervision examples that are closer to the student's current on-policy distribution. The authors evaluate on mathematical reasoning benchmarks including GSM8K, MATH, and held-out competition-level sets (AMC, AIME, OlympiadBench). Results show significant accuracy improvements over standard offline distillation baselines, with more stable trace quality and no degradation in instruction-following. The approach demonstrates that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without requiring online rollouts.

Key Points
  • DCOD corrects teacher-student distribution drift without costly online sampling, outperforming prior offline methods on math benchmarks.
  • Evaluated on GSM8K, MATH, MATH500, and harder competitions (AMC, AIME, OlympiadBench) with improved reasoning accuracy.
  • Method preserves instruction-following capabilities while producing more stable reasoning traces than standard offline distillation.

Why It Matters

Enables cheaper, more reliable distillation of reasoning skills into small models for math and logic tasks.