Research & Papers

Dowling et al.'s Bayesian Layer unifies Mamba-2, DeltaNet, and linear attention

One probabilistic framework explains and improves multiple sub-quadratic sequence models

Deep Dive

A new paper from Matthew Dowling, Hyungju Jeon, Cristina Savin, and Il Memming Park introduces the design-model framework, a principled way to derive efficient recurrent sequence maps from explicit memory assumptions. Their Bayesian Layer instantiates this framework with linear-Gaussian dynamics: it maintains both a mean (the current hidden state) and a covariance that tracks uncertainty over stored associations. This covariance steers writes toward uncertain directions, attenuates gains as evidence accumulates, and preserves confident memories — effectively giving the model a built-in sense of what it knows and doesn't know.

The framework unifies several popular sub-quadratic recurrent architectures under a single probabilistic lens. Linear attention, GLA, and Mamba-2/SSD appear as exact Bayesian filters under one design model, while DeltaNet and related delta-rule models emerge as covariance-reset reductions under another. Restoring the full covariance in those reduced models yields closed-form predictions for retrieval dynamics — verified empirically — and improves robustness beyond the training regime. In controlled collision studies, learned associative recall, and the Zoology MQAR benchmark, the Bayesian Layer consistently outperforms baselines. Notably, distilling Bayesian Layers into a pretrained 340M-parameter Gated DeltaNet improves RULER long-context retrieval at matched compute, demonstrating practical benefit without extra inference cost.

Key Points
  • Bayesian Layer propagates both mean and covariance, tracking uncertainty over stored associations
  • Unifies linear attention, GLA, Mamba-2/SSD as exact filters; DeltaNet as covariance-reset reduction
  • Distilling into a 340M Gated DeltaNet improves RULER long-context retrieval at matched compute

Why It Matters

A unified probabilistic theory for efficient sequence models could guide better long-context architectures