Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks
A new mathematical framework captures rare but critical fluctuations that govern how AI models learn features.
A team of researchers has published a significant theoretical advance for understanding how modern Bayesian neural networks learn. Their paper, 'Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks,' introduces a new mathematical framework that moves beyond the standard Neural Network Gaussian Process (NNGP) theory. The NNGP limit, which treats infinitely wide networks as Gaussian processes, fails to capture the rare but statistically dominant fluctuations that actually govern where the network's posterior probability concentrates. The new work applies large-deviation theory to provide explicit 'rate functions'—variational objectives defined directly on predictors—that quantify these fluctuations and offer an emerging notion of functional complexity tied to feature learning.
The key breakthrough is showing that the posterior's behavior is determined by a joint optimization over both the final predictor and the internal representations (kernels), contrasting with fixed-kernel NNGP theory. This allows the model to capture data-dependent kernel selection, a core mechanism of feature learning. The researchers validated their theory with numerical experiments, demonstrating it accurately describes the non-Gaussian statistical tails and posterior deformation observed in real, moderately-sized finite-width networks. This provides a more precise bridge between infinite-width theory and practical network behavior, offering new tools to analyze and understand how neural networks generalize and adapt their internal features based on data.
- Moves beyond the standard Neural Network Gaussian Process (NNGP) limit to model rare, dominant fluctuations.
- Uses large-deviation theory to provide explicit variational objectives (rate functions) for predictors.
- Captures finite-width network behavior like non-Gaussian tails and data-dependent kernel selection, core to feature learning.
Why It Matters
Provides a sharper theoretical tool to understand and design neural networks that learn meaningful features from data, beyond simplistic Gaussian approximations.