DIET: Learning to Distill Dataset Continually for Recommender Systems
New AI method distills massive streaming datasets, cutting model development costs by 60x while preserving accuracy.
A research team led by Jiaqing Zhang has introduced DIET, a novel framework designed to solve a critical bottleneck in modern recommender system development. Platforms like streaming services and e-commerce sites rely on massive, continuously growing datasets of user interactions. Currently, testing a new model architecture requires prohibitively expensive retraining on the entire historical dataset, which can contain billions of records. DIET addresses this by formulating the problem as 'streaming dataset distillation,' creating and maintaining a compact, evolving dataset that preserves the essential training signals of the full data stream.
Unlike static distillation methods, DIET treats the distilled data as a dynamic training memory that updates in stages to stay aligned with long-term data trends. It uses a bi-level optimization framework with principled initialization from influential samples and selective, influence-aware memory updates. In experiments on large-scale benchmarks, DIET successfully compressed training data to a mere 1-2% of its original size. Crucially, the performance trends observed when training on this tiny distilled set remained consistent with full-data training, enabling valid architecture comparisons.
The impact is a dramatic reduction in computational cost and development time. The framework reduces the cost of model iteration by up to 60 times. Furthermore, the distilled datasets produced are reusable and generalize across different model architectures, establishing them as a scalable, reusable data foundation. This breakthrough could significantly accelerate the pace of innovation for AI-powered recommendation engines that power much of the modern web.
- Compresses massive, streaming training data for recommender systems to just 1-2% of original size.
- Enables up to 60x faster model iteration by eliminating need for full retraining on historical data.
- Produces reusable distilled datasets that generalize across different model architectures for scalable R&D.
Why It Matters
Dramatically lowers the cost and time for developing the AI that recommends products, videos, and content on major platforms.