Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
New method compresses long contexts into learnable 'gist tokens' for efficient sparse attention...
Researchers Yuzhen Mao, Michael Y. Li, and Emily B. Fox from Stanford University have introduced Gist Sparse Attention (GSA), a novel mechanism that bridges the gap between full KV-cache compression and selective attention without requiring architectural modifications to existing LLMs. The core innovation is the use of interleaved 'gist compression tokens'—learnable summary tokens that represent sets of raw tokens—which also serve as routing signals for sparse attention. This enables a coarse-to-fine process: first, the context is compressed into gist tokens, then the most relevant gists are selected, and finally, the corresponding raw chunks are restored for detailed attention. The method is trained end-to-end, eliminating the need for external retrieval modules.
Empirically, GSA consistently outperforms other compression baselines and inference-time sparse attention methods across compression ratios from 8x to 32x on LongBench and RAG benchmarks. The framework extends hierarchically via recursive gist-of-gist construction, achieving multi-resolution context access with logarithmic per-step decoding complexity. This means LLMs can 'forget' irrelevant details during compression and 'recall' them on demand, making long-context processing both efficient and accurate. The code is publicly available on GitHub.
- Achieves 8x to 32x compression ratios on LLM context without external retrieval modules
- Uses learnable 'gist tokens' as both compression summaries and sparse attention routing signals
- Hierarchical gist-of-gist construction enables logarithmic decoding complexity for multi-resolution access
Why It Matters
GSA makes long-context LLM inference practical by dramatically reducing memory and compute costs without sacrificing accuracy.