Weight decay controls grokking transitions in transformers, new paper shows
A single scalar parameter governs memorization, generalization, and collapse in transformer training on modular arithmetic.
A new arXiv paper by Lucky Verma investigates how weight decay regulates the sharp transitions between memorization, generalization (grokking), and collapse in transformers trained on modular arithmetic. By analyzing attention activations alone, the author introduces two cheap online diagnostics—mean pairwise attention-head cosine similarity and entropy standard deviation—that track training dynamics at lower compute cost than traditional loss-landscape methods. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), weight decay serves as a scalar empirical control parameter.
The study localizes the memorization-to-developmental grokking boundary at λc=0.0158 (95% CI [0.0109, 0.0200], N=210) using a logistic fit, and reports an empirical power-law exponent ν=0.757 (CI [0.725, 0.799]), which excludes reference exponents from mean-field (ν=0.5) and 3D Ising (ν≈0.63) universality classes under the four-bin grid used. A horizon-matched multi-task replication (n=280, four modular operations) and cross-architecture probes (4L MLP, 4L LSTM, 4L Mamba; each n=70) confirm the weight-decay-controlled transition with architecture-specific λc values. The author notes the scope is limited to modular arithmetic in small attention models; broader claims about language models or universality classes are not covered.
- Weight decay acts as a scalar control parameter separating memorization, developmental grokking, and collapse across 11 conditions and model sizes from 0.82M to 85M parameters.
- Two cheap diagnostics—mean pairwise attention-head cosine similarity and entropy standard deviation—track phase transitions from attention activations alone, complementing loss-landscape methods at lower compute.
- The memorization-to-developmental grokking boundary was precisely localized at λc=0.0158 (95% CI), and the pattern replicated across MLP, LSTM, and Mamba architectures with architecture-specific critical values.
Why It Matters
Provides practical, low-cost diagnostics to monitor grokking in transformers, enabling more efficient training of reasoning models.