CP-MoE framework solves catastrophic forgetting with transient expert routing
A new MoE method achieves SOTA on SuperNI and VQA v2 while reducing parameter interference.
A team of researchers from UNSW Sydney (Yang Liu, Toan Nguyen, Flora D. Salim) has introduced CP-MoE (Consistency-Preserving Mixture-of-Experts), a novel framework for continual learning that tackles catastrophic forgetting in large language models (LLMs) and vision-language models (VLMs). Traditional MoE continual learning methods using LoRA have a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer, or allow task-specific updates to overwrite important parameters, causing severe forgetting. CP-MoE breaks this deadlock with two key innovations: a consistency-preserving routing bias that uses a transient expert to estimate representation similarity with stable experts and guide routing toward compatible selections, and a transient expert-guided regularization mechanism that selectively protects important historical parameters during merging.
Validated on both unimodal and multimodal benchmarks, CP-MoE achieves state-of-the-art results on the SuperNI benchmark (spanning diverse sequential language tasks) and demonstrates stronger zero-shot transfer to unseen tasks. On the VQA v2 dataset, CP-MoE scales effectively to multimodal visual reasoning, consistently reducing forgetting and outperforming strong MoE baselines. By preserving cross-task knowledge transfer while reducing parameter interference, CP-MoE offers a practical path for deploying continually learning MoE models in real-world applications where models must adapt to new tasks without forgetting previous ones.
- CP-MoE introduces a transient expert that captures early task-specific updates and guides integration into stable experts, reducing parameter interference.
- Achieves state-of-the-art performance on the SuperNI benchmark with stronger zero-shot transfer to unseen tasks.
- Scales to multimodal visual reasoning on VQA v2, consistently reducing forgetting compared to strong MoE baselines.
Why It Matters
CP-MoE enables LLMs/VLMs to learn new tasks without forgetting, critical for adaptive AI assistants and robotics.