Research & Papers

KVServe compresses KV cache, boosting LLM serving speed by 32.8x

Researchers slash KV cache traffic with adaptive compression that learns service context.

Deep Dive

Disaggregated LLM serving—splitting prefill and decode or disaggregating KV state—improves scalability but turns KV cache into a dominant bottleneck as it crosses network and storage boundaries. Existing KV compression methods use static runtime configurations, failing to adapt to changing production service contexts like workload mix, bandwidth, and SLO/quality budgets. This leads to suboptimal performance or even increased latency.

KVServe, accepted at SIGCOMM 2026, introduces a modular compression strategy space with cross-method recomposition. Its Bayesian Profiling Engine efficiently searches this space to produce a 3D Pareto candidate set, cutting offline search overhead by 50x. A Service-Aware Online Controller combines an analytical latency model with a lightweight bandit to select optimal profiles under real-time constraints, correcting offline-to-online mismatches. Tested across datasets, models, GPUs, and networks in vLLM, KVServe delivers up to 9.13x job completion time speedup (PD-separated) and 32.8x time-to-first-token reduction (KV-disaggregated).

Key Points
  • Achieves up to 9.13x JCT speedup in PD-separated serving and 32.8x TTFT reduction in KV-disaggregated serving.
  • Reduces offline search overhead by 50x using the Bayesian Profiling Engine.
  • Deployed in vLLM, adapts to varying workload, bandwidth, and SLO constraints via a bandit-based online controller.

Why It Matters

Eliminates KV cache as the bottleneck in disaggregated LLM serving, enabling faster, cheaper, and more adaptive AI inference.