Research & Papers

ScoreStop uses functional score tests for smarter early stopping in gradient boosting

Ditch arbitrary patience: a new statistical test tells you exactly when to stop training.

Deep Dive

Gradient boosted decision trees (GBDTs) power many machine learning pipelines, but they notoriously overfit if trained too long. The standard early-stopping fix — stop when validation loss doesn't improve for N epochs — requires a hard-to-tune patience parameter and can be fooled by noisy loss curves. ScoreStop, presented at the ICML 2026 Workshop on Hypothesis Testing, reframes the problem as a hypothesis test: at each iteration, test the null hypothesis that the current model is already the population risk minimizer. The test uses a functional score statistic computed on validation data that leverages gradients rather than loss values. Because gradients are scale-invariant and follow a known asymptotic distribution under the null, the method eliminates the need for arbitrary patience thresholds.

Crucially, ScoreStop works with any loss that has a gradient, including implicit losses such as LambdaRank (used in learning-to-rank) and data-dependent losses like Cox regression (handled via influence functions). In synthetic experiments and real-data benchmarks, ScoreStop matches the performance of loss-based early stopping while being more principled and tuning-free. This makes it especially valuable for practitioners using XGBoost, LightGBM, or CatBoost who want robust stopping without manual calibration. The method could also extend to neural networks and other gradient-based models, though the paper focuses on tree ensembles.

Key Points
  • ScoreStop replaces the arbitrary patience parameter with a functional score test that has a known asymptotic distribution under the null hypothesis.
  • The method uses gradients instead of loss values, making it scale-invariant and applicable to implicit losses like LambdaRank and data-dependent losses like Cox regression.
  • In benchmarks, ScoreStop achieves competitive performance with loss-based early stopping while eliminating the need for manual tuning of patience.

Why It Matters

Eliminates guesswork in early stopping for gradient boosting, saving time and improving model generalization automatically.