Learning to Forget: Continual Learning with Adaptive Weight Decay
Neural networks that forget strategically—per parameter, on the fly.
Continual learning agents face a fundamental trade-off: they must absorb new information without catastrophically forgetting old knowledge. Traditional weight decay applies a fixed scalar rate uniformly across all parameters and over time, which is suboptimal when some weights encode stable knowledge while others track rapidly changing targets. In a new preprint titled "Learning to Forget: Continual Learning with Adaptive Weight Decay," a team led by Jürgen Schmidhuber at IDSIA proposes FADE (Forgetting through Adaptive Decay), which adapts per-parameter weight decay rates online via approximate meta-gradient descent.
FADE treats forgetting as a learnable process. The authors derive the method for the online linear setting and then apply it to the final layer of neural networks. Empirically, FADE automatically discovers distinct decay rates for different parameters, complements step-size adaptation, and consistently outperforms fixed weight decay across online tracking and streaming classification benchmarks. This work suggests that strategic, adaptive forgetting—rather than uniform decay—can significantly improve the efficiency and accuracy of continual learning systems.
- FADE adapts per-parameter weight decay rates in real time using approximate meta-gradient descent.
- Unlike fixed scalar decay, it allows stable weights to persist while volatile weights forget faster.
- Outperforms uniform weight decay on online tracking and streaming classification tasks.
Why It Matters
Adaptive forgetting could make AI agents far more efficient at lifelong learning without catastrophic interference.