How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?
New 62-page paper reveals how gradient descent's 'implicit bias' selects minimum-norm solutions with high probability.
A team of researchers including Kuo-Wei Lai, Guanghui Wang, Molei Tao, and Vidya Muthukumar has published a significant paper on arXiv (2603.04895) that mathematically characterizes how gradient descent (GD) implicitly selects solutions when training shallow ReLU neural networks. The work addresses a fundamental question in machine learning theory: when multiple global minima exist for overparameterized models, which solution does gradient descent actually find? Their research bridges two extremes—prior work showing implicit bias doesn't exist in worst-case scenarios (Vardi and Shamir, 2021) and work showing exact minimum-l2-norm solutions under perfectly orthogonal data (Boursier et al., 2022). The new analysis provides a probabilistic guarantee for what happens with realistic, high-dimensional random data.
The researchers developed a novel primal-dual analysis framework that carefully tracks the evolution of predictions and data-span coefficients during training. Their key finding is that for sufficiently high-dimensional random features, gradient descent approximates the minimum-l2-norm solution with high probability, with the approximation gap scaling as Θ(√(n/d)), where n is the number of training examples and d is the feature dimension. This 62-page theoretical work demonstrates that ReLU activation patterns stabilize quickly during optimization, providing mathematical justification for why neural networks trained with gradient descent tend to find simple, generalizable solutions rather than complex, overfitting ones. The results have implications for understanding generalization in deep learning and could inform better optimization algorithms and architecture design.
- Proves gradient descent on ReLU networks approximates minimum-l2-norm solutions with high probability for high-dimensional random data
- Quantifies approximation gap as Θ(√(n/d)) where n is samples and d is feature dimension
- Uses novel primal-dual analysis tracking prediction evolution and activation pattern stabilization
Why It Matters
Provides mathematical foundation for why neural networks generalize well, informing better model design and training practices.