Research & Papers

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

New framework rethinks KV cache restoration as a parallel execution problem.

Deep Dive

CacheFlow, developed by Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, and Fan Lai, addresses a critical bottleneck in serving long-context LLMs: KV cache restoration. As LLMs handle tasks like multi-turn conversations, RAG, and agentic pipelines, restoring the KV cache from scratch or offloaded storage has become a dominant latency factor. Existing methods treat this as a per-request tradeoff between recomputation and I/O, missing opportunities for parallelism across tokens, layers, and distributed deployments, and ignoring resource contention in batched serving.

CacheFlow reimagines restoration as a multi-dimensional parallel execution problem. It introduces a unified 3D parallelism abstraction that enables fine-grained overlap of recomputation and I/O along transformer inference dependencies. At its core is a batch-aware two-pointer scheduler that jointly optimizes compute and I/O allocation across requests, prioritizing operations with the highest marginal reduction in recomputation cost. Evaluations show CacheFlow reduces TTFT by 10%-62% across diverse models, workloads, and hardware, making it a significant advance for latency-sensitive LLM applications.

Key Points
  • CacheFlow reduces Time-To-First-Token (TTFT) by 10%-62% over existing KV cache restoration methods.
  • It introduces a unified 3D parallelism abstraction across tokens, layers, and GPUs to overlap recomputation and I/O.
  • A batch-aware two-pointer scheduler optimizes compute and I/O allocation by prioritizing operations with the highest marginal recomputation cost reduction.

Why It Matters

Faster KV cache restoration directly speeds up long-context LLM tasks like chatbots and RAG pipelines.