Fast-Slow Training (FST) enables 3x faster LLM learning while preventing forgetting
Dual-speed weights let LLMs learn 3x faster with 70% less forgetting
Large language models face a fundamental trade-off: updating model parameters (e.g., via RL) improves task performance but causes catastrophic forgetting and loss of plasticity, while in-context learning is cheap but underperforms. A new paper introduces Fast-Slow Training (FST), which treats model parameters as slow weights that retain general reasoning and optimized context (prompts) as fast weights that absorb task-specific information via textual feedback. This dual-time-scale approach mimics human learning (System 1 vs System 2) and allows models to adapt rapidly without overwriting core knowledge.
In experiments, FST proved up to 3x more sample-efficient than pure RL across reasoning tasks, consistently reaching higher accuracy asymptotes. Critically, FST-trained models stayed up to 70% closer to the base LLM (lower KL divergence), drastically reducing catastrophic forgetting. This preserved plasticity: after training on one task, FST models adapted to a second task much more effectively than RL-trained models. In continual learning scenarios where task domains change on the fly, FST continued acquiring new tasks while parameter-only RL stalled entirely. The method offers a practical path to LLMs that learn continually without forgetting.
- FST is up to 3x more sample-efficient than RL on reasoning tasks while achieving higher final performance
- FST reduces catastrophic forgetting by 70% less KL divergence from the base LLM compared to RL training
- FST preserves plasticity and enables continual learning across changing tasks, where RL stalls completely
Why It Matters
FST enables LLMs to continually learn new tasks without erasing old knowledge, essential for real-world adaptive AI.