FST is up to 3x more sample-efficient than RL on reasoning tasks while achieving higher final performance?

FST is up to 3x more sample-efficient than RL on reasoning tasks while achieving higher final performance

FST reduces catastrophic forgetting by 70% less KL divergence from the base LLM compared to RL training?

FST reduces catastrophic forgetting by 70% less KL divergence from the base LLM compared to RL training

FST preserves plasticity and enables continual learning across changing tasks, where RL stalls completely?

FST preserves plasticity and enables continual learning across changing tasks, where RL stalls completely

Research & Papers

Fast-Slow Training (FST) enables 3x faster LLM learning while preventing forgetting

r/MachineLearning May 13, 2026

⚡Dual-speed weights let LLMs learn 3x faster with 70% less forgetting

Deep Dive

Large language models face a fundamental trade-off: updating model parameters (e.g., via RL) improves task performance but causes catastrophic forgetting and loss of plasticity, while in-context learning is cheap but underperforms. A new paper introduces Fast-Slow Training (FST), which treats model parameters as slow weights that retain general reasoning and optimized context (prompts) as fast weights that absorb task-specific information via textual feedback. This dual-time-scale approach mimics human learning (System 1 vs System 2) and allows models to adapt rapidly without overwriting core knowledge.

In experiments, FST proved up to 3x more sample-efficient than pure RL across reasoning tasks, consistently reaching higher accuracy asymptotes. Critically, FST-trained models stayed up to 70% closer to the base LLM (lower KL divergence), drastically reducing catastrophic forgetting. This preserved plasticity: after training on one task, FST models adapted to a second task much more effectively than RL-trained models. In continual learning scenarios where task domains change on the fly, FST continued acquiring new tasks while parameter-only RL stalled entirely. The method offers a practical path to LLMs that learn continually without forgetting.

Key Points

FST is up to 3x more sample-efficient than RL on reasoning tasks while achieving higher final performance
FST reduces catastrophic forgetting by 70% less KL divergence from the base LLM compared to RL training
FST preserves plasticity and enables continual learning across changing tasks, where RL stalls completely

Why It Matters

FST enables LLMs to continually learn new tasks without erasing old knowledge, essential for real-world adaptive AI.

Read Original Article

Fast-Slow Training (FST) enables 3x faster LLM learning while preventing forgetting

Why It Matters

Related Articles

🚀 Stay Ahead in AI