Audio & Speech

Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

Researchers prove weight decay makes Transformer loss landscapes 'Villani' – enabling faster optimization and better generalization.

Deep Dive

Weight decay is a cornerstone regularizer in large language models, yet its theoretical role in shaping Transformer loss landscapes has remained poorly understood. In a new paper, researchers Abhijit Das and Sayantan Dutta provide the first rigorous functional-analytic characterization: they prove that the standard cross-entropy loss with L2 regularization satisfies Villani's criteria for coercive energy functions. This means the loss landscape is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies a differential growth condition. From this structure, the authors derive explicit log-Sobolev and Poincaré constants—C_LS ≤ λ⁻¹ + d/λ²—directly linking regularization strength λ and model dimension d to finite-time convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds that tighten with increasing λ.

To validate their theory, Das and Dutta introduce a scalable diagnostic, Ψ_s(θ) = -ΔF + s⁻¹||∇F||², estimated efficiently using Hutchinson trace probes on models with over 100 million parameters. Experiments on GPT-Neo-125M across Penn Treebank and WikiText-103 confirm the predicted quadratic growth of Ψ_s, spectral inflation of the Hessian, and exponential convergence behavior consistent with their log-Sobolev analysis. These results show that weight decay does more than empirically improve generalization—it establishes the mathematical conditions required for fast Langevin mixing and theoretically grounded curvature-aware optimization in deep learning, offering a new foundation for understanding and accelerating Transformer training.

Key Points
  • First rigorous proof that weight decay makes Transformer loss landscapes satisfy Villani's criteria for coercive energy functions, enabling fast optimization.
  • Derived explicit log-Sobolev constants C_LS ≤ λ⁻¹ + d/λ², linking regularization strength λ and model dimension d to convergence guarantees and generalization bounds.
  • Validated theory on GPT-Neo-125M (100M parameters) using a novel Villani diagnostic, confirming quadratic growth, Hessian spectral inflation, and exponential convergence.

Why It Matters

Provides a theoretical foundation for weight decay, enabling faster training and better generalization in large language models.