Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation
New distillation technique lets smaller AI models outperform their larger teachers on complex reasoning tasks.
Researchers Minsang Kim and Seung Jun Baek have introduced a novel AI training framework called Token-Selective Dual Knowledge Distillation (TSD-KD) that significantly improves how reasoning capabilities are transferred from large, powerful models to smaller, more efficient ones. The core innovation addresses a key weakness in standard Knowledge Distillation (KD), where forcing a small 'student' model to exactly mimic a large 'teacher' can overwhelm its limited capacity, especially on complex reasoning tasks requiring Chain-of-Thought. TSD-KD takes a student-centric approach, focusing on distilling only the most important tokens for reasoning and allowing the student to explain concepts in its own words rather than through rigid imitation.
The framework combines two complementary techniques. First, 'indirect distillation' provides softer guidance: the student generates its own candidate responses, and the teacher simply re-ranks them as feedback, avoiding the pressure of exact distribution matching. Second, 'direct distillation' is applied selectively, only matching token distributions where the teacher shows significantly higher confidence than the student. This dual approach, paired with entropy regularization to maintain the student's confidence, facilitates genuine self-improvement. The results are striking: TSD-KD achieved state-of-the-art performance across 10 challenging reasoning benchmarks, outperforming the best baseline by up to 54.4% in accuracy and the runner-up by 40.3%. Most notably, in four cases, the smaller student model trained with TSD-KD actually outperformed its own, larger teacher model by margins of up to 20.3%, demonstrating the method's effectiveness at eliciting superior reasoning from more constrained architectures.
This work, accepted at the prestigious ICLR 2026 conference, represents a major step toward democratizing advanced AI reasoning. By making it possible to create smaller, cheaper models that rival or even exceed the performance of their massive predecessors on specific reasoning tasks, TSD-KD lowers the barrier to deploying sophisticated AI in cost-sensitive or latency-critical environments, from edge devices to large-scale applications.
- TSD-KD framework improves reasoning transfer to small models by up to 54.4% on benchmarks.
- Uses a dual approach: indirect ranking feedback and selective direct token distillation.
- Enabled student models to outperform their own teacher models by up to 20.3% in four cases.
Why It Matters
Enables cheaper, smaller AI models to achieve superior reasoning, reducing deployment costs and computational barriers for advanced AI applications.