SplitLight: An Exploratory Toolkit for Recommender Systems Datasets and Splits
Open-source Python tool diagnoses hidden data issues that can completely reorder model rankings.
A research team has launched SplitLight, an open-source exploratory toolkit designed to tackle a critical but often overlooked problem in recommender systems: the hidden choices in data preparation that undermine reproducibility. The tool, detailed in a new arXiv paper, allows researchers and engineers to audit their datasets and data splits—the process of dividing data into training and testing sets—to make these decisions measurable and reportable.
SplitLight performs a comprehensive diagnostic on interaction logs, analyzing core statistics, temporal patterns, timestamp anomalies, and the validity of data splits. It specifically flags issues like temporal leakage (where future data contaminates the training set), cold-user/item exposure, and distribution shifts between splits. These seemingly minor preprocessing decisions, such as how to handle repeat consumption or filter users, have been shown to substantially reorder model performance rankings, making cross-paper comparisons unreliable. The toolkit enables side-by-side comparison of different splitting strategies through aggregated summaries and interactive visualizations.
The practical impact is significant for both industry and academia. Delivered as a Python package and an interactive no-code web interface, SplitLight generates audit summaries that justify evaluation protocols. This moves the field toward transparent and reliable experimentation, ensuring that a model claiming superior performance on a dataset like MovieLens or Amazon Reviews is actually better, not just benefiting from an advantageous but undocumented data split. It addresses a foundational issue in machine learning evaluation, promoting rigor in a field where benchmarking is paramount.
- Diagnoses hidden data split issues like temporal leakage and cold-start exposure that can reorder model rankings.
- Provides both a Python library for integration and a no-code interactive interface for visualization and comparison.
- Generates audit summaries to justify evaluation protocols, aiming to fix reproducibility crises in recommender systems research.
Why It Matters
Ensures recommender system benchmarks are reliable and reproducible, a foundational need for both industrial deployment and academic research.