New paper explains why neural network curvature varies by layer type
Spectral Alignment Decomposition reveals why convolutions have α≈2, transformers α≈1...
A new preprint on arXiv (arXiv:2606.02596) by Anherutowa Calvo provides an exact decomposition of the curvature exponent α in neural network loss landscapes. The paper proves the Spectral Alignment Decomposition: α = 2 + dlogΦ_k / dlogσ_k, where Φ_k measures alignment between Kronecker factor eigenbases and gradient singular directions. This explains why α varies systematically — α≈2 for convolutions, ≈1 for transformer attention, and <1 for MLP up-projections. The decomposition implies a spectral transfer identity s = αγ linking the curvature exponent, effective gradient rank-decay γ, and Hessian decay exponent s. Empirically, fitting α and γ on independent data (HVPs vs. SVD) recovers s to ~2% median error across 93 layers, five architectures, and three datasets — with no free parameters.
As a proof of concept, the author derives the architecture-adaptive preconditioner T(σ;α) and shows that Spectral Newton — implementing T in the gradient singular basis — outperforms AdamW on vision benchmarks where α≈2. The paper also includes a zeta-function bound on participation ratio, showing curvature concentrates onto effectively one direction per layer. This work not only provides a theoretical understanding of why different layers have different curvature properties, but also offers a practical optimizer that leverages these insights for improved training efficiency.
- Spectral Alignment Decomposition: α = 2 + dlogΦ_k / dlogσ_k, explaining curvature variation across layer types
- Spectral transfer identity s = αγ validated with ~2% median error across 93 layers, 5 architectures, 3 datasets
- New Spectral Newton optimizer outperforms AdamW on vision benchmarks where α≈2
Why It Matters
Provides a theoretical foundation for layer-specific curvature, enabling architecture-adaptive optimizers that outperform standard methods like AdamW.