LKV: End-to-end learning slashes LLM KV cache to 15% with near-lossless accuracy
New learned compression keeps 85% of cache but loses almost no performance.
Long-context inference in large language models is bottlenecked by the linear growth of key-value (KV) cache memory, making compression a critical research area. Existing methods rely on heuristics—either static budgeting (e.g., uniform per-head allocation) or handcrafted selection rules like attention sinks—which often misallocate resources or miss task-specific patterns. Enter LKV, a new framework from researchers at the Chinese Academy of Sciences and associated institutes, which reformulates KV cache eviction as an end-to-end differentiable optimization problem.
LKV has two components: LKV-H learns a global, task-optimized budget per attention head, while LKV-T learns intrinsic token importance scores without needing to materialize full attention matrices. This bypasses all heuristic proxies, aligning compression strictly with task loss. Evaluated on LongBench and RULER, LKV achieves state-of-the-art results at high compression rates. Most strikingly, on LongBench it retains only 15% of the KV cache yet delivers near-lossless performance—a 6.7× reduction in memory with negligible accuracy loss. The analysis further shows that learned budgeting is the dominant driver of fidelity, proving that data-driven allocation outperforms hand-crafted heuristics across diverse long-context tasks.
- LKV achieves near-lossless performance on LongBench with only 15% KV cache retention (6.7× compression).
- Uses end-to-end differentiable learning for both head-wise budgets (LKV-H) and token importance (LKV-T), eliminating heuristic trade-offs.
- Outperforms prior heuristic-based methods on both LongBench and RULER benchmarks at high compression ratios.
Why It Matters
Radically cuts memory cost for long-context LLMs, enabling cheaper inference without special hardware or pruning.