Bayesian Layer propagates both mean and covariance, tracking uncertainty over stored associations?

Bayesian Layer propagates both mean and covariance, tracking uncertainty over stored associations

Unifies linear attention, GLA, Mamba-2/SSD as exact filters; DeltaNet as covariance-reset reduction?

Unifies linear attention, GLA, Mamba-2/SSD as exact filters; DeltaNet as covariance-reset reduction

Research & Papers

Dowling et al.'s Bayesian Layer unifies Mamba-2, DeltaNet, and linear attention

arXiv stat.ML June 01, 2026

⚡One probabilistic framework explains and improves multiple sub-quadratic sequence models

Deep Dive

A new paper from Matthew Dowling, Hyungju Jeon, Cristina Savin, and Il Memming Park introduces the design-model framework, a principled way to derive efficient recurrent sequence maps from explicit memory assumptions. Their Bayesian Layer instantiates this framework with linear-Gaussian dynamics: it maintains both a mean (the current hidden state) and a covariance that tracks uncertainty over stored associations. This covariance steers writes toward uncertain directions, attenuates gains as evidence accumulates, and preserves confident memories — effectively giving the model a built-in sense of what it knows and doesn't know.

The framework unifies several popular sub-quadratic recurrent architectures under a single probabilistic lens. Linear attention, GLA, and Mamba-2/SSD appear as exact Bayesian filters under one design model, while DeltaNet and related delta-rule models emerge as covariance-reset reductions under another. Restoring the full covariance in those reduced models yields closed-form predictions for retrieval dynamics — verified empirically — and improves robustness beyond the training regime. In controlled collision studies, learned associative recall, and the Zoology MQAR benchmark, the Bayesian Layer consistently outperforms baselines. Notably, distilling Bayesian Layers into a pretrained 340M-parameter Gated DeltaNet improves RULER long-context retrieval at matched compute, demonstrating practical benefit without extra inference cost.

Key Points

Bayesian Layer propagates both mean and covariance, tracking uncertainty over stored associations
Unifies linear attention, GLA, Mamba-2/SSD as exact filters; DeltaNet as covariance-reset reduction
Distilling into a 340M Gated DeltaNet improves RULER long-context retrieval at matched compute

Why It Matters

A unified probabilistic theory for efficient sequence models could guide better long-context architectures

Read Original Article

Dowling et al.'s Bayesian Layer unifies Mamba-2, DeltaNet, and linear attention

Why It Matters

Related Articles

🚀 Stay Ahead in AI