MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
New linear attention model MDN outperforms Transformers and Mamba2 at 1.3B scale with momentum optimization.
Linear attention (LA) models like Mamba2 and GDN avoid the quadratic complexity of standard self-attention, enabling scaling to extremely long sequences. However, their naive closed-form SGD recurrences suffer from rapid information decay and suboptimal convergence. Momentum-based optimizers could fix this, but they pose a challenge: how to parallelize the stepwise updates efficiently during training.
Momentum DeltaNet (MDN) solves this by geometrically reordering the update coefficients to develop a chunkwise parallel algorithm. From a dynamical systems perspective, the researchers analyze the momentum recurrence as a second-order system with complex conjugate eigenvalues, guiding stable gating constraints. Implemented with Triton kernels, MDN matches the training throughput of competitive linear models like Mamba2 and KDA. In experiments with 400M and 1.3B parameter models, MDN consistently outperforms strong baselines including Transformers, Mamba2, and GDN across diverse downstream benchmarks. The code is released on GitHub.
- Introduces stepwise momentum to linear attention recurrence, overcoming rapid information decay from naive SGD updates.
- Achieves training throughput comparable to Mamba2 via a Triton-based chunkwise parallel algorithm with geometric reordering.
- Outperforms Transformers, Mamba2, and GDN at 400M and 1.3B parameter scales on diverse downstream benchmarks.
Why It Matters
Linear attention that scales to long sequences with momentum could enable more efficient LLMs without sacrificing quality.