Research & Papers

Are Flat Minima an Illusion?

New study shows loss landscape geometry is a confounder—weakness is invariant and predictive.

Deep Dive

A new preprint by Michael Timothy Bennett (arXiv:2605.05209) argues that the widely accepted link between flat minima and neural network generalization is an illusion. Flat minima—regions of the loss landscape where small perturbations don't change loss much—have long been thought to indicate better generalization. Sharpness-Aware Minimization (SAM) even exploits this heuristic. But Bennett shows that function-preserving reparameterization can inflate the Hessian by two orders of magnitude without changing a single prediction. If the geometry can be manufactured from nothing, it cannot be a causal factor. Instead, he proposes 'weakness,' defined as the volume of completions compatible with the learned function in the learner's embodied language. Weakness is reparameterization-invariant because it depends on what the network does, not how it's parameterized.

Bennett backs his theory with extensive experiments. On MNIST with 100 identically trained networks, weakness predicts generalization (ρ=+0.374, p=0.00012), while sharpness anticorrelates (ρ=-0.226) and simplicity predicts nothing (p=0.848). On Fashion-MNIST, weakness again leads (ρ=+0.384, p=8.15×10⁻⁵), though simplicity shows some predictive power. Crucially, he shows that the large-batch generalization advantage—often cited as evidence for flat minima—vanishes as training data grows: from +1.6% at n=2,000 to just +0.02% at n=60,000. This demonstrates that flatness is a confounder, not a cause. The paper proves weakness is minimax-optimal under exchangeable demands and that PAC-Bayes bounds work because they correlate with weakness. The implication is clear: the ML community's focus on loss landscape geometry may be misguided—generalization is about the function's compatibility with the learner's language, not the shape of the weight space.

Key Points
  • Bennett proves flat minima are an illusion: reparameterization inflates Hessian without changing predictions, showing geometry is not causal.
  • On MNIST, weakness predicts generalization (ρ=+0.374, p=0.00012) while sharpness anticorrelates (−0.226) and simplicity fails (p=0.848).
  • Large-batch generalization advantage vanishes from +1.6% at 2K samples to +0.02% at 60K samples, confirming flatness is a confounder.

Why It Matters

This could fundamentally shift how researchers design and regularize neural networks—away from loss landscape heuristics toward function-space analysis.