Continual Distillation of Teachers from Different Domains
New CVPR 2026 paper tackles knowledge loss when training on multiple domain experts.
Deep learning models are growing so large that storing and retraining them is impractical. Researchers Nicolas Michel, Maorong Wang, Jiangpeng He, and Toshihiko Yamasaki propose a new paradigm called Continual Distillation (CD), where a compact student model learns sequentially from a stream of teacher models—without ever needing to revisit the teacher's original training data. This addresses two major challenges: teacher data is often unavailable, and each teacher specializes in a different domain. The team shows that external unlabeled data can enable Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains the teacher knows but not seen during student training.
However, sequential distillation introduces a new problem: Unseen Knowledge Forgetting (UKF). As the student trains on later teachers, it loses previously transferred knowledge. To trade off between UKT and UKF, the authors propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show SE2D reduces UKF significantly and improves cross-domain generalization. The work is accepted at CVPR 2026 and code is available on GitHub.
- Introduces Continual Distillation (CD) where a student learns from a stream of teachers without access to original training data.
- Proposes SE2D (Self External Data Distillation) to balance Unseen Knowledge Transfer (UKT) and Unseen Knowledge Forgetting (UKF) using external unlabeled data.
- Accepted at CVPR 2026 with publicly available code; experiments show reduced forgetting and better cross-domain generalization.
Why It Matters
Enables efficient, incremental model updates from multiple domain experts without catastrophic forgetting or data access.