LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
New sparse attention method solves KV inflation, achieving 4.14x speedup on GPUs while improving accuracy.
A research team from UC Berkeley, KAIST, and other institutions has introduced LoSA (Locality Aware Sparse Attention), a novel method designed to accelerate block-wise diffusion language models (DLMs). Unlike traditional autoregressive models that generate tokens sequentially, DLMs can produce multiple tokens in any order, offering parallel generation capabilities. However, they've been bottlenecked by memory-bound attention in long-context scenarios, where naive sparse attention fails due to KV inflation—different queries selecting different prefix positions, making the union of accessed key-value pages prohibitively large.
LoSA addresses this fundamental limitation through a key insight: between consecutive denoising steps, only a small fraction of "active" tokens exhibit significant hidden-state changes, while the majority of "stable" tokens remain nearly constant. The system intelligently reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This dramatically shrinks the number of KV indices that must be loaded, reducing computational overhead while maintaining model quality.
Across multiple block-wise DLM architectures and benchmarks, LoSA demonstrates remarkable performance gains. It preserves near-dense model accuracy while achieving substantial efficiency improvements, including up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. On hardware like RTX A6000 GPUs, this translates to practical speedups of up to 4.14x for attention operations.
The breakthrough represents a significant step toward making diffusion-based language models more practical for real-world applications. By solving the KV inflation problem that plagued previous sparse attention approaches for DLMs, LoSA enables faster parallel text generation without sacrificing quality, potentially making these non-autoregressive models competitive with traditional sequential generators in both speed and accuracy.
- Solves KV inflation problem in diffusion language models by reusing cached attention for stable tokens
- Achieves up to 4.14x attention speedup on RTX A6000 GPUs with practical hardware implementation
- Improves accuracy by +9 points at aggressive sparsity levels while maintaining 1.54x lower attention density
Why It Matters
Enables faster parallel text generation without quality loss, making diffusion language models practical for real applications.