Transformers can implement gradient descent in-context to compute posterior predictive mean and variance for Gaussian process regression?

Transformers can implement gradient descent in-context to compute posterior predictive mean and variance for Gaussian process regression.

The paper proves error bounds for approximated posterior probabilities that depend on attention depth and bin resolution?

The paper proves error bounds for approximated posterior probabilities that depend on attention depth and bin resolution.

Normalization and attention depth are shown to be critical for enabling extrapolation beyond the pretraining sample size range?

Normalization and attention depth are shown to be critical for enabling extrapolation beyond the pretraining sample size range.

Research & Papers

Transformers Can Learn Posterior Distributions In-Context, New Proof Shows

arXiv stat.ML May 27, 2026

⚡New paper proves transformers can generate full probability distributions, not just point estimates.

Deep Dive

A new theoretical paper from researchers Gyeonghun Kang, Changwoo J. Lee, and Xiang Cheng provides a rigorous proof that transformer architectures can learn full posterior predictive distributions (PPDs) entirely in-context, without any weight updates. The work specifically analyzes Prior-data Fitted Networks (PFNs), which have shown strong empirical performance in Bayesian prediction tasks. The authors focus on Gaussian process regression and demonstrate by construction that a transformer can internally implement a gradient descent algorithm to compute both the posterior predictive mean and variance, followed by nonlinear mappings that produce binned probabilities of the PPD. They derive explicit error bounds that depend on attention depth and bin resolution.

The paper also sheds light on why transformers can extrapolate beyond their pretraining sample sizes. They show that normalization mechanisms and appropriate attention depth are critical for this generalization. The theoretical framework reveals how PFNs approximate a full Bayesian posterior rather than just point predictions, offering insights into the expressivity of in-context learning. This work bridges a gap between the empirical success of in-context Bayesian inference and its theoretical understanding, suggesting that transformers are inherently capable of performing probabilistic reasoning — not just picking the most likely token but generating a distribution over outcomes. The findings have implications for uncertainty estimation in large language models and other transformer-based systems.

Key Points

Transformers can implement gradient descent in-context to compute posterior predictive mean and variance for Gaussian process regression.
The paper proves error bounds for approximated posterior probabilities that depend on attention depth and bin resolution.
Normalization and attention depth are shown to be critical for enabling extrapolation beyond the pretraining sample size range.

Why It Matters

This theoretically grounds why in-context learning can provide probabilistic predictions, crucial for uncertainty-aware AI without retraining.

Read Original Article

Transformers Can Learn Posterior Distributions In-Context, New Proof Shows

Why It Matters

Related Articles

🚀 Stay Ahead in AI