AlignedServe boosts LLM serving throughput by 2x with KV-cache batching
New framework reduces iteration bubbles and cuts latency by 7.4x.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team of researchers (Bai et al.) from Sun Yat-sen University has released AlignedServe, a new LLM serving framework that tackles a subtle but costly inefficiency in current inference systems: iteration-level bubbles. Standard batching treats all tokens equally, but tokens generated in the same decode iteration depend on KV caches of different lengths. Tokens with long caches slow down the entire batch, creating idle gaps. AlignedServe solves this by grouping requests with similar KV-cache lengths into the same batch, ensuring that all tokens in a batch finish decoding at roughly the same time.
To make this grouping practical, AlignedServe uses large CPU memory to maintain a deep pool of in-flight requests, allowing flexible batch formation. It also introduces a GPU-Prefetch-For-GPU architecture where one GPU preloads KV caches for another, nearly eliminating CPU-to-GPU transfer latency. In experiments with synthetic and real-world workloads, AlignedServe improved decoding throughput by up to 1.98x and reduced latency by up to 7.4x over systems like vLLM. The approach is particularly effective for long-context applications (e.g., document QA, code generation), where KV-cache lengths vary widely.
- AlignedServe groups requests by KV-cache length to reduce iteration-level bubbles, improving throughput up to 1.98x over state-of-the-art.
- Novel GPU-Prefetch-For-GPU architecture lowers CPU-to-GPU transfer latency, enabling efficient batching with large request pools.
- Latency reductions of up to 7.4x on application workloads, especially beneficial for long-context LLM serving.
Why It Matters
This could dramatically cut LLM inference costs in production, making long-context applications more practical and affordable.