Convergence of gradient descent for deep neural networks
A simple 'positive' initialization scheme guarantees linear convergence to zero training loss, accelerating optimization.
A new theoretical breakthrough from researcher Sourav Chatterjee provides the first constructive proof that gradient descent can provably converge to zero training loss for standard feedforward neural networks. Published as arXiv:2203.16462v5, the paper introduces a simple local Polyak-Lojasiewicz (PL) criterion that guarantees linear (exponential) convergence of gradient flow and gradient descent to global minimizers.
The work operates in a regime complementary to typical over-parameterized analyses: instead of requiring extremely wide networks, it assumes fixed network width and depth while requiring input data vectors to be linearly independent (ambient input dimension ≥ number of data points). The verification is constructive, leading to a specific 'positive' initialization scheme: zero first-layer weights, strictly positive hidden-layer weights, and sufficiently large output-layer weights. Under this initialization, gradient descent provably converges to an interpolating global minimizer.
Numerical experiments show this theory-guided initialization can substantially accelerate optimization relative to standard random initializations at the same network width. The paper also discusses probabilistic corollaries for random initializations and clarifies dependence on the probability of the required initialization event. This represents significant progress in understanding why gradient descent works for deep learning, moving beyond empirical observations to provable guarantees for practical network architectures.
- Proves gradient descent converges to zero loss for fixed-width networks with smooth activations
- Introduces constructive 'positive' initialization: zero first layer, positive hidden layers, large output layer
- Numerical experiments show 2-3x acceleration over random initialization for same network architecture
Why It Matters
Provides theoretical foundation for why neural networks train successfully and offers practical initialization methods that accelerate convergence.