Bayesian Optimality of In-Context Learning with Selective State Spaces
Selective State Space Models mathematically outperform Transformers on tasks with structured noise, offering a new architecture design principle.
A new theoretical breakthrough from researchers Di Zhang and Jiaqi Xing demonstrates that selective State Space Models (SSMs) achieve Bayesian optimality for in-context learning, fundamentally challenging the dominant 'Transformers-as-implicit-gradient-descent' paradigm. The paper, 'Bayesian Optimality of In-Context Learning with Selective State Spaces,' proves mathematically that for tasks governed by Linear Gaussian State Space Models, a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean.
Crucially, the research establishes a statistical separation from gradient descent methods, constructing specific tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be viewed as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experimental validation on synthetic LG-SSM tasks and character-level Markov benchmarks confirms selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers.
This work reframes in-context learning from 'implicit optimization' to 'optimal inference,' providing a rigorous mathematical foundation for understanding why architectures like Mamba have shown such promise. The findings offer a principled basis for future neural architecture design, suggesting that moving beyond the Transformer's implicit gradient descent approach toward explicit Bayesian inference mechanisms could yield more statistically efficient models for sequential data with complex dependencies.
- Proves selective SSMs achieve Bayesian optimality for Linear Gaussian State Space Model tasks, converging to posterior predictive mean
- Demonstrates statistical separation from gradient descent: SSMs outperform ERM estimators on tasks with temporally correlated noise
- Experimental validation shows SSMs converge faster to optimal risk and have superior sample efficiency with long contexts
Why It Matters
Provides mathematical foundation for next-gen architectures beyond Transformers, enabling more efficient sequence modeling for real-world data with complex dependencies.