Research & Papers

ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit

arXiv stat.ML March 20, 2026

⚡New math proves ResNets and Transformers converge predictably as they scale to billions of parameters.

Deep Dive

A team of researchers including Louis-Pierre Chaintron, Lénaïc Chizat, and Javier Maas has published a significant theoretical paper establishing rigorous convergence guarantees for the training dynamics of large-scale residual neural networks (ResNets). The work, titled 'ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit,' proves that as a network's depth (L), hidden width (M), and embedding dimension (D) simultaneously grow to infinity, its training behavior converges to a predictable limit. The analysis specifically considers ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime.

Crucially, the team quantified the error between a finite ResNet and its theoretical infinite-scale limit. They proved this error is bounded by O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). For a total parameter budget P scaling as Theta(L M D), this yields an overall convergence rate of O(P^(-1/6)) when the scaling of L, M, and D is optimized to minimize the bound. The analysis leverages the depth-two structure of residual blocks and combines advanced probabilistic methods like the cavity method and propagation of chaos arguments on 'skeleton maps.'

The paper's impact extends beyond ResNets. The authors state the framework applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimensions. This work completes a program initiated in a companion paper, bridging the gap between fixed-embedding-dimension dynamics and the large-D limit. It represents one of the first rigorous quantitative convergence proofs for a DMFT-type (Dynamic Mean Field Theory) limit in machine learning, moving the field from empirical observation to mathematical certainty regarding scaling behavior.

Key Points

Proves ResNet training dynamics converge at rate O(P^(-1/6)) for parameter budget P, providing a mathematical scaling law.
Error bound of O(1/L + sqrt(D/(L M)) + 1/sqrt(D)) links network depth (L), width (M), and embedding size (D).
Formally applies to modern architectures like Transformers, offering a theoretical foundation for predicting large-model training.

Why It Matters

Provides a rigorous mathematical basis for predicting how AI models like GPT and Llama will behave as they scale to trillions of parameters.

Read Original Article

ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit

Why It Matters

Stay Ahead in AI