Research & Papers

KAUST's ART boosts LLM throughput 20% with smart KV cache pruning

A runtime mechanism that cuts unnecessary memory fetches during long-context decoding...

Deep Dive

A lightweight run-time mechanism called ART (Attention Run-time Termination) tracks accumulated attention outputs during LLM decoding and stops fetching KV blocks when contributions become negligible. Unlike prior key-only pruning methods, ART operates at runtime and is orthogonal to existing KV cache management methods, enabling seamless integration. On LongBench benchmarks, it achieves 20% higher generation throughput in large batch sizes while maintaining comparable accuracy.

Key Points
  • ART tracks accumulated attention outputs at runtime and stops fetching KV blocks when contributions become negligible
  • Achieves 20% higher generation throughput in large batch sizes on LongBench benchmarks
  • Orthogonal to existing key-based KV cache pruning, enabling seamless integration without additional overhead

Why It Matters

Enables faster, memory-efficient long-context LLM inference, critical for scaling AI to massive inputs in real-time applications.