MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
New attention scheme reduces KV cache accesses by 99% and speeds up generation by over 2.6x end-to-end.
A research team including Jinghan Yao and Sam Adé Jacobs has published a paper on MAC-Attention, a breakthrough method designed to solve the major bottleneck in long-context LLM inference. Currently, generating each new token requires re-reading the entire, ever-growing Key-Value (KV) cache from memory, making the process IO-bound. Previous solutions like compression or selective caching sacrificed accuracy or accessibility. MAC-Attention takes a different approach: it reuses prior attention computations when the current query is semantically similar to a recent one, dramatically reducing the need for fresh data fetches.
The method works in three stages: Match, Amend, and Complete. First, it finds a similar recent query using a pre-RoPE L2 matching search. Then, it 'amends' the reused attention by recomputing only a small band near the match boundary for accuracy. Finally, it 'completes' the result by fusing this with fresh attention computed on the new tail of the KV cache. On a successful match, the computational and bandwidth cost becomes constant, regardless of how long the context grows.
Benchmarked against the high-performance FlashInfer library, MAC-Attention delivers staggering results. It reduces accesses to the KV cache by up to 99%, slashes per-token generation latency by over 60% at 128K context length, and achieves over 14.3x speedups in the attention phase alone. Crucially, it maintains the quality of full attention, as validated on LongBench v2, RULER, and LongGenBench. The technique is model-agnostic and compatible with existing optimized kernels and memory managers, making it a practical drop-in acceleration for current systems.
- Cuts KV cache memory accesses by up to 99%, addressing the core IO bottleneck in long-context decoding.
- Achieves over 14.3x attention-phase speedups and over 2.6x end-to-end speedups while maintaining full model accuracy.
- Uses a novel three-stage (Match-Amend-Complete) scheme to reuse computations for similar queries, making cost constant per token.
Why It Matters
Enables faster, cheaper long-context AI applications like document analysis and multi-turn chat without sacrificing output quality.