Learning to Evict from Key-Value Cache
This breakthrough could make running massive AI models 2x cheaper and faster.
Researchers introduced KV Policy (KVP), a reinforcement learning framework that optimizes the memory-intensive Key-Value cache in Large Language Models. Instead of using simple heuristics, KVP trains lightweight agents to predict which cached tokens are most useful for future text generation. The method significantly outperforms existing baselines on long-context benchmarks like RULER and generalizes well to tasks like LongBench, enabling more efficient inference without modifying the core model architecture.
Why It Matters
It dramatically reduces the cost and latency of running state-of-the-art LLMs, making advanced AI more accessible.