Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO
Researchers' curriculum learning approach helps compact models like Qwen2.5-3B learn complex reasoning from larger teachers.
A research team led by Bowen Yu has introduced a breakthrough method for efficiently transferring complex reasoning capabilities from large language models to compact student models. Their paper 'Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO' addresses a fundamental challenge in AI distillation: teacher models often produce verbose rationales that smaller models struggle to reproduce faithfully.
The three-stage curriculum learning framework begins with structural understanding through masked shuffled reconstruction, where the model learns to recognize reasoning patterns. Next, Group Relative Policy Optimization (GRPO) is applied to masked completion tasks, allowing the model to autonomously discover the optimal balance between accuracy and brevity. Finally, the system identifies persistent failure cases and guides the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO.
Experimental results demonstrate significant improvements: the Qwen2.5-3B-Base model achieved an 11.29% accuracy boost on the GSM8K mathematical reasoning benchmark while simultaneously reducing output length by 27.4%. This outperforms both standard instruction-tuned variants and previous distillation methods. The approach maintains the interpretability of Chain-of-Thought reasoning while making it practical for deployment on resource-constrained devices.
This research matters because it enables more efficient deployment of reasoning-capable AI systems. As organizations seek to run sophisticated AI locally on edge devices or with limited computational budgets, methods that preserve complex reasoning while reducing model size and output verbosity become increasingly valuable for real-world applications.
- Three-stage curriculum learning framework improves Chain-of-Thought distillation efficiency by teaching structural understanding first
- Qwen2.5-3B-Base model achieved 11.29% higher accuracy on GSM8K while reducing output length by 27.4%
- Uses Group Relative Policy Optimization (GRPO) to help models autonomously balance accuracy and brevity in reasoning
Why It Matters
Enables deployment of sophisticated reasoning AI on resource-constrained devices while maintaining interpretability and reducing computational costs.