Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
New method cuts caching costs by storing only at checkpoints, not every token.
Current LLM serving systems rely on prefix caching that assumes dense per-token key/value reuse. But state-space models (SSMs) and recurrent layers change the game: a single stored hidden state can resume generation instead of needing the full token history. Researchers Mikhail Shirokikh and Sergey Nikolenko exploit this asymmetry with a new method called sparse prefix caching. Instead of caching every token's KV-pair, they place exact recurrent states at a sparse set of checkpoint positions. On a cache hit, the system resumes from the deepest stored checkpoint and recomputes the remaining suffix exactly. This formalizes as an O(NM) dynamic program for optimal checkpoint placement given an overlap distribution across requests.
The method consistently improves the Pareto frontier traced by standard heuristics on real-world datasets like QuALITY and System Prompts. Distribution-aware placement dominates every fixed-budget baseline, and matches or outperforms the strongest heuristic (block caching) while using substantially fewer checkpoints. The largest gains occur at low checkpoint budgets where overlap distribution is most non-uniform. The approach is most relevant when many requests share a substantial but not identical prefix within a retained cache entry. It preserves exact outputs, doesn't require new recurrent kernels, and can be combined with existing KV-cache compression techniques for hybrid models.
- Formalizes sparse prefix caching as O(NM) dynamic program for optimal checkpoint placement
- Outperforms standard heuristics like block caching on QuALITY and System Prompts Pareto frontier
- Preserves exact outputs, works for recurrent/SSM layers, combinable with KV-cache compression
Why It Matters
Smarter caching for long-context LLMs serving many similar queries, cutting latency and memory.