ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit
New math proves ResNets and Transformers converge predictably as they scale to billions of parameters.
A team of researchers including Louis-Pierre Chaintron, Lénaïc Chizat, and Javier Maas has published a significant theoretical paper establishing rigorous convergence guarantees for the training dynamics of large-scale residual neural networks (ResNets). The work, titled 'ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit,' proves that as a network's depth (L), hidden width (M), and embedding dimension (D) simultaneously grow to infinity, its training behavior converges to a predictable limit. The analysis specifically considers ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime.
Crucially, the team quantified the error between a finite ResNet and its theoretical infinite-scale limit. They proved this error is bounded by O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). For a total parameter budget P scaling as Theta(L M D), this yields an overall convergence rate of O(P^(-1/6)) when the scaling of L, M, and D is optimized to minimize the bound. The analysis leverages the depth-two structure of residual blocks and combines advanced probabilistic methods like the cavity method and propagation of chaos arguments on 'skeleton maps.'
The paper's impact extends beyond ResNets. The authors state the framework applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimensions. This work completes a program initiated in a companion paper, bridging the gap between fixed-embedding-dimension dynamics and the large-D limit. It represents one of the first rigorous quantitative convergence proofs for a DMFT-type (Dynamic Mean Field Theory) limit in machine learning, moving the field from empirical observation to mathematical certainty regarding scaling behavior.
- Proves ResNet training dynamics converge at rate O(P^(-1/6)) for parameter budget P, providing a mathematical scaling law.
- Error bound of O(1/L + sqrt(D/(L M)) + 1/sqrt(D)) links network depth (L), width (M), and embedding size (D).
- Formally applies to modern architectures like Transformers, offering a theoretical foundation for predicting large-model training.
Why It Matters
Provides a rigorous mathematical basis for predicting how AI models like GPT and Llama will behave as they scale to trillions of parameters.