PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
New system from researchers allows dynamic GPU allocation for LLMs without stopping service, improving response times by up to 55%.
A research team led by Xu Bai has introduced PipeLive, a novel system designed to solve a critical bottleneck in large language model (LLM) serving: the inability to dynamically reconfigure how a model is split across GPUs without causing service downtime. Current systems use static pipeline parallelism (PP), where model layers are fixed to specific GPUs. This fails in dynamic environments like serverless platforms or when dealing with heterogeneous hardware, as stopping and redeploying the model to change this split incurs prohibitive delays. PipeLive tackles this by enabling 'live in-place' reconfiguration, meaning the allocation of model layers across GPUs can be changed on the fly while the model continues to generate text for users.
The core technical innovation is a two-part mechanism. First, PipeLive redesigns the Key-Value (KV) cache memory layout and extends the popular PageAttention mechanism (used in systems like vLLM) to allow live resizing of this cache, which is essential for changing layer placements on memory-saturated GPUs. Second, it adopts an incremental 'KV patching' technique, inspired by live virtual machine migration, to safely synchronize the evolving KV cache state between the old and new GPU configurations. This identifies a safe switch point with minimal disruption. The results are significant: reconfiguration overhead drops from seconds to under 10 milliseconds, while time-to-first-token (TTFT) and time-per-output-token (TPOT) improve by up to 54.7% and 14.7%, respectively, compared to systems without this capability.
- Enables live, in-place reconfiguration of pipeline parallelism for LLMs, reducing overhead from seconds to under 10ms.
- Uses a redesigned KV cache layout with PageAttention extension and incremental KV patching for safe state synchronization.
- Improves key performance metrics: reduces TTFT by up to 54.7% and TPOT by up to 14.7% compared to static systems.
Why It Matters
This enables truly elastic and efficient LLM serving in cloud and serverless environments, adapting to load and hardware without interrupting users.