Research & Papers

TED: Training-Free Experience Distillation for Multimodal Reasoning

arXiv cs.LG March 31, 2026

⚡New method transfers AI knowledge without expensive retraining, cutting costs by 5x using in-context prompts.

Deep Dive

A research team including Shuozhi Yuan, Jinqing Wang, and others has introduced TED (Training-Free Experience Distillation), a novel framework that fundamentally rethinks how knowledge is transferred between AI models. Traditional knowledge distillation requires extensive parameter updates and large datasets, making it resource-intensive. TED bypasses this by shifting the update target from model parameters to contextual experiences injected directly into the student model's prompt. For each input, the student generates multiple reasoning paths while a teacher model produces its own solution. The teacher then compares these trajectories against its reasoning and ground-truth answers to extract generalized, effective reasoning patterns. These distilled 'experiences' are continuously refined and updated over time, creating a dynamic knowledge base.

A critical innovation in TED is its experience compression mechanism, which tackles the inherent problem of unbounded growth and noise accumulation in context-based systems. The framework tracks usage statistics and selectively merges, rewrites, or removes low-utility experiences to maintain efficiency. In experiments on multimodal reasoning benchmarks MathVision and VisualPuzzles, TED demonstrated significant gains. It raised the performance of the Qwen3-VL-8B model from 0.627 to 0.702 on MathVision and from 0.517 to 0.561 on VisualPuzzles using only 100 training samples. Under this low-data, no-parameter-update setting, TED achieved performance competitive with fully trained parameter-based distillation while reducing the associated training cost by over 5x. This breakthrough suggests that meaningful knowledge transfer can be effectively achieved through contextual prompting rather than exhaustive retraining.

Key Points

Shifts knowledge transfer from parameter updates to in-context prompts, eliminating traditional training cycles.
Boosts Qwen3-VL-8B's MathVision score by 12% (0.627 to 0.702) with just 100 samples.
Uses compression to manage prompt growth, reducing training costs by over 5x compared to full distillation.

Why It Matters

Enables rapid, cost-effective improvement of existing AI models without the massive compute typically required for fine-tuning.

Read Original Article

TED: Training-Free Experience Distillation for Multimodal Reasoning

Why It Matters

Stay Ahead in AI