CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
New framework rethinks KV cache restoration as a parallel execution problem.
CacheFlow, developed by Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, and Fan Lai, addresses a critical bottleneck in serving long-context LLMs: KV cache restoration. As LLMs handle tasks like multi-turn conversations, RAG, and agentic pipelines, restoring the KV cache from scratch or offloaded storage has become a dominant latency factor. Existing methods treat this as a per-request tradeoff between recomputation and I/O, missing opportunities for parallelism across tokens, layers, and distributed deployments, and ignoring resource contention in batched serving.
CacheFlow reimagines restoration as a multi-dimensional parallel execution problem. It introduces a unified 3D parallelism abstraction that enables fine-grained overlap of recomputation and I/O along transformer inference dependencies. At its core is a batch-aware two-pointer scheduler that jointly optimizes compute and I/O allocation across requests, prioritizing operations with the highest marginal reduction in recomputation cost. Evaluations show CacheFlow reduces TTFT by 10%-62% across diverse models, workloads, and hardware, making it a significant advance for latency-sensitive LLM applications.
- CacheFlow reduces Time-To-First-Token (TTFT) by 10%-62% over existing KV cache restoration methods.
- It introduces a unified 3D parallelism abstraction across tokens, layers, and GPUs to overlap recomputation and I/O.
- A batch-aware two-pointer scheduler optimizes compute and I/O allocation by prioritizing operations with the highest marginal recomputation cost reduction.
Why It Matters
Faster KV cache restoration directly speeds up long-context LLM tasks like chatbots and RAG pipelines.