Transformers Can Learn Posterior Distributions In-Context, New Proof Shows
New paper proves transformers can generate full probability distributions, not just point estimates.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new theoretical paper from researchers Gyeonghun Kang, Changwoo J. Lee, and Xiang Cheng provides a rigorous proof that transformer architectures can learn full posterior predictive distributions (PPDs) entirely in-context, without any weight updates. The work specifically analyzes Prior-data Fitted Networks (PFNs), which have shown strong empirical performance in Bayesian prediction tasks. The authors focus on Gaussian process regression and demonstrate by construction that a transformer can internally implement a gradient descent algorithm to compute both the posterior predictive mean and variance, followed by nonlinear mappings that produce binned probabilities of the PPD. They derive explicit error bounds that depend on attention depth and bin resolution.
The paper also sheds light on why transformers can extrapolate beyond their pretraining sample sizes. They show that normalization mechanisms and appropriate attention depth are critical for this generalization. The theoretical framework reveals how PFNs approximate a full Bayesian posterior rather than just point predictions, offering insights into the expressivity of in-context learning. This work bridges a gap between the empirical success of in-context Bayesian inference and its theoretical understanding, suggesting that transformers are inherently capable of performing probabilistic reasoning — not just picking the most likely token but generating a distribution over outcomes. The findings have implications for uncertainty estimation in large language models and other transformer-based systems.
- Transformers can implement gradient descent in-context to compute posterior predictive mean and variance for Gaussian process regression.
- The paper proves error bounds for approximated posterior probabilities that depend on attention depth and bin resolution.
- Normalization and attention depth are shown to be critical for enabling extrapolation beyond the pretraining sample size range.
Why It Matters
This theoretically grounds why in-context learning can provide probabilistic predictions, crucial for uncertainty-aware AI without retraining.