RcLLM: New caching system speeds up LLM recommendations by 9.5x
Researchers slash recommendation latency with beyond-prefix KV caching and smart sharding.
Large Language Models (LLMs) are increasingly used for generative recommendation — converting user histories and item catalogs into personalized outputs — but deployment lags due to high latency from long, non-reusable prompts. Standard prefix caching fails because user histories and item contexts rarely overlap contiguously. Enter RcLLM: a distributed inference system that introduces Beyond-Prefix KV Caching. It breaks prompts into granular reusable blocks and employs a stratified storage design — compact user-history caches are replicated across nodes for near-zero latency retrieval, while massive item caches are sharded using similarity-aware placement. To minimize redundant quadratic attention, RcLLM combines an affinity-based global scheduler (improving data locality) with a selective attention mechanism that compensates for approximation errors.
On real-world datasets, RcLLM reduces Time-To-First-Token by 1.31x–9.51x compared to state-of-the-art prefix caching systems, enabling real-time serving of generative recommendation with negligible accuracy loss. The system, accepted at ICDCS 2026, tackles both storage efficiency and compute overhead simultaneously. For engineers building recommendation engines on top of LLMs, this means dramatic latency cuts without retraining models — a practical step toward industrial-scale generative recommendation.
- RcLLM decomposes prompts into reusable blocks for non-contiguous caching, unlike standard prefix caching.
- Stratified storage: user-history caches replicated for zero-latency; item caches sharded by similarity.
- Achieves 1.31x–9.51x TTFT improvement with an affinity scheduler and selective attention to preserve accuracy.
Why It Matters
Enables real-time generative recommendation at scale, bringing LLM-powered personalization closer to production deployment.