TED: Training-Free Experience Distillation for Multimodal Reasoning
New method transfers AI knowledge without expensive retraining, cutting costs by 5x using in-context prompts.
A research team including Shuozhi Yuan, Jinqing Wang, and others has introduced TED (Training-Free Experience Distillation), a novel framework that fundamentally rethinks how knowledge is transferred between AI models. Traditional knowledge distillation requires extensive parameter updates and large datasets, making it resource-intensive. TED bypasses this by shifting the update target from model parameters to contextual experiences injected directly into the student model's prompt. For each input, the student generates multiple reasoning paths while a teacher model produces its own solution. The teacher then compares these trajectories against its reasoning and ground-truth answers to extract generalized, effective reasoning patterns. These distilled 'experiences' are continuously refined and updated over time, creating a dynamic knowledge base.
A critical innovation in TED is its experience compression mechanism, which tackles the inherent problem of unbounded growth and noise accumulation in context-based systems. The framework tracks usage statistics and selectively merges, rewrites, or removes low-utility experiences to maintain efficiency. In experiments on multimodal reasoning benchmarks MathVision and VisualPuzzles, TED demonstrated significant gains. It raised the performance of the Qwen3-VL-8B model from 0.627 to 0.702 on MathVision and from 0.517 to 0.561 on VisualPuzzles using only 100 training samples. Under this low-data, no-parameter-update setting, TED achieved performance competitive with fully trained parameter-based distillation while reducing the associated training cost by over 5x. This breakthrough suggests that meaningful knowledge transfer can be effectively achieved through contextual prompting rather than exhaustive retraining.
- Shifts knowledge transfer from parameter updates to in-context prompts, eliminating traditional training cycles.
- Boosts Qwen3-VL-8B's MathVision score by 12% (0.627 to 0.702) with just 100 samples.
- Uses compression to manage prompt growth, reducing training costs by over 5x compared to full distillation.
Why It Matters
Enables rapid, cost-effective improvement of existing AI models without the massive compute typically required for fine-tuning.