Hugging Face adds On-Policy Distillation used by Qwen 3.6, DeepSeek-V4
New AI training technique teaches models to learn from mistakes without full regeneration.
On-Policy Distillation (OPD) has become one of the hottest terms in AI research, and the Hugging Face open-source team has now added it to PapersWithCode. Niels, the contributor, explains that OPD is the key post-training technique behind models like Qwen 3.6 and 3.7, GLM-5.1, and DeepSeek-V4. The method addresses a core challenge in reinforcement learning: when a model makes a mistake during a rollout (e.g., calling a non-existent tool), the final reward signal is too noisy to pinpoint the exact error. OPD solves this by having a secondary model read the trajectory and insert hint tokens immediately above where the error occurred. These hints cause the primary model to assign lower probabilities to the erroneous tokens during a forward pass, without needing to regenerate a new rollout—meaning no additional decode cost. The original model is then trained to match these adjusted probabilities, effectively downweighting that specific mistake.
Sasha Rush, formerly a colleague at Hugging Face and now at Cursor, recently created an excellent whiteboard explanation of OPD with Dwarkesh, which is now linked directly on the PapersWithCode method page. This resource makes the technique accessible to a wider audience of researchers and practitioners. OPD represents a shift toward more efficient post-training corrections, particularly useful for complex, multi-step tasks where errors are localized but rewards are sparse. By reducing the need for full rollout regeneration and focusing on precise error localization, OPD speeds up training and improves model reliability—likely why it has been adopted by several leading model families. Niels invites the community to suggest other methods to add, signaling that PapersWithCode will continue to track emerging training innovations.
- On-Policy Distillation (OPD) is a post-training technique used by Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4.
- OPD injects hint tokens at error points in rollout trajectories to adjust probabilities without regenerating the rollout (no new decode).
- Sasha Rush (ex-Hugging Face, now at Cursor) provided an excellent whiteboard explanation with Dwarkesh, now linked on PapersWithCode.
Why It Matters
Efficiently corrects model mistakes during rollout, speeding up post-training and improving reasoning accuracy.