Research & Papers

SiDP paper frees KV cache for 1.8x more capacity in LLM inference

Treats model weights as shared resource across GPUs to slash memory waste.

Deep Dive

The rapid shift of LLM inference to offline, throughput-oriented workloads has exposed a fundamental tension in GPU deployment. Traditional data parallelism (DP) replicates full model weights on every GPU, leaving little memory for the key-value (KV) cache—the primary bottleneck for large batch sizes. Model parallelism saves memory but requires fine-grained synchronization that kills scheduling flexibility. SiDP, introduced by Alan Zhao and Cyril Y. He, solves this by treating model weights as a bandwidth-backed shared resource within a DP group. Each layer is owned by a single GPU; other replicas access weights on demand via two complementary modes: Weight-as-a-Service (WaS) for large batches (streams remote weights over NVLink into a small cache) and Compute-as-a-Service (CaS) for small batches (ships activations to the weight owner).

Evaluation on NVIDIA H20, H200, and B200 GPUs with Qwen3-32B, Qwen2.5-72B, and Llama-3.1-70B shows SiDP increases usable KV capacity by up to 1.8x under identical configurations, translating to up to 1.5x higher end-to-end throughput over the vLLM baseline for offline workloads. The approach requires no changes to model architecture, making it a drop-in optimization for existing deployment stacks. By decoupling weight storage from compute, SiDP opens the door to more aggressive batching without expensive hardware upgrades.

Key Points
  • SiDP replaces full model replication with a distributed weight pool, each layer owned by one GPU.
  • Two execution modes: Weight-as-a-Service (streams weights) and Compute-as-a-Service (ships activations).
  • Up to 1.8x more KV cache capacity and 1.5x throughput gain on Qwen/Llama models vs. vLLM.

Why It Matters

Enables larger batch sizes on existing GPU clusters without costly hardware upgrades, directly reducing inference cost.