New theory proves deep transformers converge to ODE dynamics
Researchers show AdamW-trained transformers scale uniformly with depth and heads.
A new paper from William Gibson and Christoph Reisinger tackles a fundamental question in deep learning: how do transformers behave as we scale them up? By modeling the hidden states as an interacting particle system coupled through attention, the authors prove that under AdamW training, the joint dynamics of hidden states and backpropagated variables converge uniformly to a system of forward-backward ordinary differential equations (ODEs). The convergence rate is O(L^{-1} + L^{-1/3} H^{-1/2}), where L is depth and H is number of heads. Notably, the bounds are independent of the number of tokens and, with a slight adaptation to AdamW, independent of the embedding dimension.
The key insight is that the limiting ODE system reduces to a McKean–Vlasov ODE when no causal masking is used. Using flow maps and concentration of measure, the authors avoid covering arguments, yielding cleaner constants. This theoretical result validates practitioners' intuition that deeper transformers can be trained predictably, and provides a rigorous framework for understanding how attention mechanisms drive collective behavior. For engineers designing next-generation LLMs, this means scaling laws might be more reliable than previously thought, potentially reducing trial-and-error in architecture search.
- Proves convergence of transformer dynamics to ODEs at rate O(L^{-1} + L^{-1/3} H^{-1/2})
- Error bounds are uniform over token count and embedding dimension under AdamW
- Identifies limiting system as McKean–Vlasov ODE for non-masked attention
Why It Matters
Provides theoretical guarantees for scaling deep transformers, enabling more predictable training and architecture design.