New paper proves flatness correlates with generalization despite network symmetries
A decade-long debate on flat minima vs. generalization finally gets a rigorous mathematical resolution.
A new paper on arXiv (2606.04429) tackles the long-standing tension between the flatness heuristic for generalization and the critique by Dinh et al. (2017), who showed that network symmetries can arbitrarily alter flatness measures like Hessian trace without changing the model's behavior. The authors—Harsh Vardhan, Hossein Taheri, and Arya Mazumdar—consider 2-layer homogeneous neural networks trained on multi-index models, specifically data generated by a sum of single-index models. They prove two key results. First, there exists a natural class of non-generalizing interpolators whose flatness cannot be made arbitrarily close to the flattest possible, even using the full symmetry group of the network. Second, when approximation error and label noise are low, any interpolator with orderwise minimum flatness (the 'flattest' interpolator) achieves small population loss, meaning it generalizes.
This establishes a direct mathematical link between flatness and generalization for a large class of activations (including ReLU) and realistic data distributions, despite the apparent symmetries that previously made the heuristic seem vacuous. The work goes beyond earlier edge-case analyses by focusing on the flattest interpolators rather than arbitrary ones. It suggests that while any individual interpolator can be made sharper or flatter via reparameterization, the set of possible flatness values is constrained for non-generalizing solutions. The paper is a significant step toward a rigorous theory of why flat minima generalize, with implications for model selection and understanding implicit bias of gradient methods.
- Proves that for 2-layer homogeneous networks learning sums of single-index models, flattest interpolators (minimum Hessian trace) generalize under low noise and approximation error.
- Identifies a class of non-generalizing interpolators whose flatness cannot be reduced to the flattest possible, even using network symmetries—contra Dinh et al.'s critique.
- Establishes a rigorous connection between flatness and generalization for a broad family of activations (including ReLU) and realistic data distributions, not just worst-case examples.
Why It Matters
Brings mathematical rigor to a key deep learning heuristic, potentially guiding practical model selection and training stability.