ReAD uses reinforcement learning to guide LLM capability distillation
Uncertainty-aware bandit dynamically allocates token budget to task-relevant abilities
Traditional capability distillation compresses large language models into smaller ones by focusing on selected abilities, but it treats capabilities independently and ignores how improving one affects others. Under a fixed token budget, researchers observed two patterns: distillation causes systematic cross-capability transfer, and adding more budget often brings limited task-relevant gains while degrading other abilities. This leads to wasted training tokens and harmful spillover.
To address this, the team introduces ReAD (Reinforcement-Guided Capability Distillation). The framework first infers which capabilities are essential for the downstream task, generates targeted supervision on the fly, then uses an uncertainty-aware contextual bandit to allocate distillation budget adaptively based on expected utility. Extensive experiments show ReAD improves task utility under the same token budget while reducing harmful spillover compared to strong baselines. Code is publicly available.
- ReAD identifies two consistent patterns in capability distillation: systematic cross-capability transfer and diminishing returns from extra budget with potential degradation.
- The framework uses an uncertainty-aware contextual bandit to adaptively allocate token budget based on expected utility gains for each capability.
- Experiments demonstrate improved downstream task utility while reducing harmful spillover and wasted distillation effort versus strong baselines.
Why It Matters
Smarter token budgeting in LLM compression reduces training costs and preserves task performance, enabling more efficient model deployment.