Research & Papers

The Unreasonable Effectiveness of Data for Recommender Systems

Research on 11 datasets and 10 algorithms finds performance keeps improving up to 100M interactions.

Deep Dive

A new research paper titled 'The Unreasonable Effectiveness of Data for Recommender Systems' by Youssef Abdou challenges the assumption that recommendation models hit diminishing returns with more data. The study implemented a rigorous evaluation workflow using LensKit and RecBole toolkits, testing 10 different algorithm combinations across 11 massive public datasets—each containing at least 7 million user-item interactions. Researchers trained models on sample sizes ranging from 100,000 to 100 million interactions, measuring performance using the standard NDCG@10 metric.

Contrary to expectations, the analysis revealed no observable saturation point where additional data stopped providing meaningful gains. When results were normalized for comparison, approximately 75% of the tested configurations achieved their best performance at the largest completed sample size of 100 million interactions. A late-stage slope analysis further confirmed this trend, showing consistently positive improvement rates in the final training stages across most dataset-algorithm pairs.

The findings have significant implications for how companies approach recommender system development. While much attention has focused on developing more sophisticated algorithms like neural networks or transformer-based models, this research suggests that for traditional collaborative filtering approaches, simply collecting more interaction data remains one of the most reliable ways to improve recommendation quality. The study did note weaker scaling behavior in atypical dataset cases and with the RecBole BPR algorithm, indicating that data effectiveness isn't universal but applies broadly to mainstream approaches.

Key Points
  • Tested 10 algorithm combinations across 11 datasets with 7M+ interactions each
  • Found no performance saturation point even at 100 million training samples
  • 75% of configurations achieved best normalized NDCG@10 at maximum data scale

Why It Matters

Prioritizing data collection over algorithmic complexity could be more cost-effective for improving traditional recommender systems.