Research & Papers

Scaling Attention via Feature Sparsity

New method cuts attention cost by 50% and speeds up training by 2.5x, enabling ultra-long contexts.

Deep Dive

A team of researchers, including Yan Xie, Tiansheng Wen, and Stefanie Jegelka, has published a paper titled 'Scaling Attention via Feature Sparsity,' accepted at ICLR 2026. The work tackles the fundamental bottleneck in scaling Transformer models to ultra-long contexts: the quadratic O(n²d) computational cost of self-attention. Instead of focusing on sequence-level sparsity like existing methods, the team proposes an orthogonal approach called Sparse Feature Attention (SFA), which introduces sparsity at the feature level. SFA represents queries and keys as k-sparse codes, preserving high-dimensional expressivity while drastically reducing the attention cost to O(n²k²/d).

To make this practical, the researchers developed FlashSFA, an IO-aware kernel that extends the popular FlashAttention framework to operate directly on sparse overlaps without constructing dense intermediate matrices. In extensive experiments, SFA demonstrated its efficacy by matching the accuracy of dense attention baselines during the pretraining of models like GPT-2 and Qwen3. Crucially, it achieved this while delivering up to a 2.5x speed improvement and reducing both computational FLOPs and the memory-heavy KV-cache by nearly 50%. The method also proved robust on synthetic and downstream benchmarks, maintaining retrieval accuracy in long contexts where other efficient approximations often fail.

This research establishes feature-level sparsity as a powerful and previously underexplored axis for optimization. By providing a path to efficient long-context modeling without significant quality degradation, SFA and FlashSFA could enable the next generation of large language models to process documents, codebases, and conversations that are orders of magnitude longer than what is feasible today.

Key Points
  • Proposes Sparse Feature Attention (SFA), reducing cost from O(n²d) to O(n²k²/d) via feature sparsity, an orthogonal approach to sequence sparsity.
  • Includes FlashSFA, an efficient kernel extending FlashAttention to work directly on sparse codes, avoiding dense matrix materialization.
  • Achieves 2.5x faster pretraining speed and ~50% reduction in FLOPs/KV-cache for GPT-2 and Qwen3 while matching dense baseline accuracy.

Why It Matters

Enables more efficient training of LLMs for ultra-long contexts, potentially reducing compute costs and powering new applications in long-document analysis.