KAUST's ART boosts LLM throughput 20% with smart KV cache pruning
A runtime mechanism that cuts unnecessary memory fetches during long-context decoding...
A lightweight run-time mechanism called ART (Attention Run-time Termination) tracks accumulated attention outputs during LLM decoding and stops fetching KV blocks when contributions become negligible. Unlike prior key-only pruning methods, ART operates at runtime and is orthogonal to existing KV cache management methods, enabling seamless integration. On LongBench benchmarks, it achieves 20% higher generation throughput in large batch sizes while maintaining comparable accuracy.
- ART tracks accumulated attention outputs at runtime and stops fetching KV blocks when contributions become negligible
- Achieves 20% higher generation throughput in large batch sizes on LongBench benchmarks
- Orthogonal to existing key-based KV cache pruning, enabling seamless integration without additional overhead
Why It Matters
Enables faster, memory-efficient long-context LLM inference, critical for scaling AI to massive inputs in real-time applications.