Research & Papers

New SDPG method stabilizes LLM training with self-distillation

SDPG slashes reward sparsity by combining policy gradients with self-distillation...

Deep Dive

A new paper from researchers Yifeng Liu, Shiyuan Zhang, Yifan Zhang, and Quanquan Gu introduces Self-Distilled Policy Gradient (SDPG), a reinforcement learning framework designed to tackle sparse reward signals in large language model fine-tuning. The core insight is that on-policy self-distillation—where the model conditions on privileged context to supervise its own outputs—can serve as a dense supervisory signal. The authors formalize this as an auxiliary full-vocabulary reverse KL divergence loss from student to teacher, enabling richer gradient updates than standard RL with sparse rewards.

SDPG combines three key components: group-relative verifier advantages (GRVA) to normalize reward comparisons, exact full-vocabulary on-policy self-distillation to provide dense supervision, and reference-policy KL regularization to prevent catastrophic forgetting. Empirically, the method shows improved training stability and higher final performance compared to both pure RL with verifier rewards (RLVR) and prior self-distillation baselines. The authors have released code to facilitate reproduction. This work provides a practical recipe for making RL-based LLM alignment more sample-efficient and reliable.

Key Points
  • SDPG uses on-policy self-distillation with a full-vocabulary reverse KL divergence loss to provide dense supervision.
  • Integrates group-relative verifier advantages and normalized standard deviation for stable reward comparisons.
  • Outperforms RLVR and self-distillation baselines in stability and final performance, with code open-sourced.

Why It Matters

A more stable way to fine-tune LLMs with RL, reducing reward engineering and improving sample efficiency.