Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization
A 39-page paper proves Adam's instability is mathematically predictable—and when SGD beats it.
A new theoretical paper from Cornell researchers Sahu, Sarkar, Hogan, and Wells provides the first rigorous proof of when Adam outperforms vanilla SGD—and when it doesn't—in non-stationary optimization. The work, published on arXiv (2605.04269), separates two regimes: Euclidean tracking under adaptive strong monotonicity and high-probability projected stationarity under general L-smooth objectives. The authors decompose Adam's error into four sharp components: initialization, objective drift, first-moment tracking governed by β1, and preconditioner perturbation governed by β2. They characterize the burn-in time needed to reach Adam's irreducible tracking floor under constant and step-decay schedules, and prove a high-probability bound on the average projected stationarity gap under distribution shift.
Across both analyses, the bounds reveal a fundamental noise–drift tradeoff. In noise-dominated regimes, Adam's first-moment averaging and adaptive preconditioning improve high-probability error. However, in drift-dominated regimes—where the objective changes meaningfully over time—stale first-moment information and preconditioner perturbations compound the cost of nonstationarity, allowing vanilla SGD to achieve a smaller tracking floor. The explicit (β1, β2, ε)-dependent bounds provide a theoretical mechanism for Adam's well-known empirical instability and offer practical guidance: when data distributions shift rapidly, practitioners may want to increase β1 (forget faster) or switch to SGD entirely. The paper includes 11 figures and 1 table, with 39 pages of rigorous analysis.
- Adam's error decomposes into four components: initialization, drift, first-moment tracking (β1), and preconditioner perturbation (β2).
- Noise-dominated regimes favor Adam; drift-dominated regimes allow vanilla SGD to achieve a smaller tracking floor.
- Explicit (β1, β2, ε)-dependent bounds explain Adam's instability under distribution shift and guide hyperparameter tuning.
Why It Matters
Gives ML practitioners a principled rule for choosing optimizers when training data distributions shift over time.