Audio & Speech

Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

arXiv eess.AS May 08, 2026

⚡Researchers prove weight decay makes Transformer loss landscapes 'Villani' – enabling faster optimization and better generalization.

Deep Dive

Weight decay is a cornerstone regularizer in large language models, yet its theoretical role in shaping Transformer loss landscapes has remained poorly understood. In a new paper, researchers Abhijit Das and Sayantan Dutta provide the first rigorous functional-analytic characterization: they prove that the standard cross-entropy loss with L2 regularization satisfies Villani's criteria for coercive energy functions. This means the loss landscape is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies a differential growth condition. From this structure, the authors derive explicit log-Sobolev and Poincaré constants—C_LS ≤ λ⁻¹ + d/λ²—directly linking regularization strength λ and model dimension d to finite-time convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds that tighten with increasing λ.

To validate their theory, Das and Dutta introduce a scalable diagnostic, Ψ_s(θ) = -ΔF + s⁻¹||∇F||², estimated efficiently using Hutchinson trace probes on models with over 100 million parameters. Experiments on GPT-Neo-125M across Penn Treebank and WikiText-103 confirm the predicted quadratic growth of Ψ_s, spectral inflation of the Hessian, and exponential convergence behavior consistent with their log-Sobolev analysis. These results show that weight decay does more than empirically improve generalization—it establishes the mathematical conditions required for fast Langevin mixing and theoretically grounded curvature-aware optimization in deep learning, offering a new foundation for understanding and accelerating Transformer training.

Key Points

First rigorous proof that weight decay makes Transformer loss landscapes satisfy Villani's criteria for coercive energy functions, enabling fast optimization.
Derived explicit log-Sobolev constants C_LS ≤ λ⁻¹ + d/λ², linking regularization strength λ and model dimension d to convergence guarantees and generalization bounds.
Validated theory on GPT-Neo-125M (100M parameters) using a novel Villani diagnostic, confirming quadratic growth, Hessian spectral inflation, and exponential convergence.

Why It Matters

Provides a theoretical foundation for weight decay, enabling faster training and better generalization in large language models.

Read Original Article

Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

Why It Matters

Stay Ahead in AI