Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks
New paper shows stochastic gradient descent noise encodes information about which features a neural network is learning.
A team of researchers including Guillaume Corlouer and Avi Semler has published a significant theoretical paper analyzing how stochastic gradient descent (SGD) behaves during neural network training. Using deep linear networks (DLNs) as an analytically tractable model, they investigated the "saddle-to-saddle" regime where optimization moves between unstable points. The researchers modeled SGD dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise, providing a mathematical framework to understand how randomness affects training.
Under specific assumptions about weight alignment and balance, the team derived an exact decomposition of the complex dynamics into a system of simpler one-dimensional stochastic differential equations per mode. This breakthrough revealed that maximal diffusion (noise) along a particular mode occurs just before that feature is completely learned by the network. Essentially, the noise pattern serves as a signal about what the network is currently focusing on.
The research also characterized the stationary distributions that SGD converges to, showing they match gradient flow distributions without label noise and approximate Boltzmann distributions with label noise. Crucially, experiments confirmed these theoretical findings hold qualitatively even without the strict alignment assumptions. The paper establishes that while SGD noise provides information about feature learning progression, it doesn't fundamentally change the underlying saddle-to-saddle dynamics that characterize deep network optimization.
- SGD noise patterns reveal which features a neural network is currently learning, with maximal diffusion preceding complete feature acquisition
- Researchers derived exact decompositions of complex SGD dynamics into simpler one-dimensional equations per mode under aligned weight assumptions
- Experimental validation shows findings hold qualitatively even without strict theoretical assumptions, confirming practical relevance
Why It Matters
Provides theoretical foundation for interpreting training dynamics and could lead to better optimization diagnostics and more efficient training algorithms.