Research & Papers

Everett & Paquette reveal momentum's phase transitions under sparse updates

New theory shows momentum can destabilize training when learning timescales mismatch retention.

Deep Dive

Everett and Paquette's new paper, submitted to arXiv, addresses a critical gap in momentum theory: existing models assume gradients arrive at every parameter at a constant rate, which is violated by heavy-tailed data and modern architectures. They analyze two tractable models — least squares with sparse inputs and logistic regression with a rare class — and derive exact closed-form second-moment dynamics. Their high-dimensional limits are characterized across three scaling exponents for sparsity, batch size, and momentum decay.

The key finding is a phase diagram governed by the ratio of two intrinsic timescales: momentum retention (how many active updates the buffer survives) and learning (how many active updates to reduce squared error). When learning is much slower than retention, the limit matches standard SGD. When learning is faster, the system becomes unstable. Only when timescales coincide do we recover classical heavy-ball dynamics. Importantly, oscillatory dynamics occur at different momentum values for different token sparsity, meaning that any global momentum hyperparameter will conflict across token frequencies — a fundamental limitation for training on sparse, high-dimensional data.

Key Points
  • Closed-form dynamics derived for least squares (sparse inputs) and logistic regression (rare class) under momentum with sparse updates.
  • Phase structure depends on the ratio of momentum retention timescale to learning timescale — only when equal do we get heavy-ball behavior.
  • Oscillatory dynamics vary with token sparsity, creating a spectral conflict that prevents a single momentum hyperparameter from working well across all token frequencies.

Why It Matters

Explains why momentum can fail in sparse, high-dimensional settings; guides better hyperparameter selection for modern architectures.