Rolling-Origin Validation Reverses Model Rankings in Multi-Step PM10 Forecasting: XGBoost, SARIMA, and Persistence
A new validation method reveals XGBoost's performance claims in air quality forecasting may be overstated.
A new machine learning study from researchers Federico Garcia Crespi, Eduardo Yubero Funes, and Marina Alfosea Simon delivers a critical lesson for AI model evaluation. Analyzing 2,350 days of PM10 air pollution data, they compared the popular gradient-boosting library XGBoost against the statistical SARIMA model and a simple persistence baseline. The key finding is methodological: using a standard static chronological data split, XGBoost appeared to perform well for forecasts 1-7 days ahead. However, when evaluated under a rolling-origin protocol—which mimics the real-world process of monthly model retraining and updating—the rankings completely reversed.
Under this more rigorous, operationally realistic validation, XGBoost was not consistently better than the naive persistence model at short and intermediate forecast horizons. Meanwhile, the SARIMA model maintained positive skill relative to persistence across the full forecast range. The study introduces the concept of a 'predictability horizon'—the maximum lead time where a model outperforms persistence—as a practical metric for practitioners. The core takeaway is that many published claims of machine learning superiority in time-series forecasting may be artifacts of flawed evaluation, overstating operational usefulness.
- Rolling-origin validation, simulating monthly updates, reversed model rankings compared to static splits.
- XGBoost failed to consistently beat a simple persistence baseline under realistic validation, challenging prior claims.
- The statistical SARIMA model showed reliable, positive skill across all forecast horizons in the operational test.
Why It Matters
For data scientists, this mandates more rigorous, operationally realistic validation to avoid deploying overhyped models.