Research & Papers

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving

The research system combines disaggregated serving with runtime speculation to hit 2,235 tokens/sec on summarization.

Deep Dive

A team of researchers including Satyam Kumar and Saurabh Jha has published a paper on StreamServe, a new system designed to dramatically improve the efficiency of large language model (LLM) inference. The core innovation is a "disaggregated prefill-decode serving architecture" that separates the initial prompt processing (prefill) from the token generation (decode) stages, allowing them to be optimized independently. This framework combines two key techniques: metric-aware routing to intelligently distribute requests across available GPU compute lanes, and adaptive speculative decoding, which dynamically adjusts how many future tokens to predict (the speculation depth) based on real-time performance signals.

StreamServe is built from four coordinated components: the StreamScheduler for request orchestration, FlowGuard for multi-signal routing decisions, the PipeServe Engine that executes the disaggregated prefill and decode operations on multiple GPUs, and SpecuStream, which handles the runtime adaptive speculation. The team evaluated the system using four benchmarks—ALPACA, GSM8K, HUMANEVAL, and SUM—with a total of 320 queries, running on a setup of four NVIDIA A800 40GB GPUs configured as two stream pairs.

The results are striking. Compared to standard tensor-parallel baselines like vLLM, StreamServe reduced latency by a factor of 11 to 18 times. It also reached a peak throughput of 2,235 tokens per second on summarization tasks. Crucially, the time per output token remained stable across configurations, proving the speed gains come from architectural efficiency, not from degrading the quality of the generated text. While tested on a single 4-GPU node, the paper argues that jointly optimizing routing and speculation within this disaggregated framework creates a new, high-performance operating regime for low-latency LLM serving.

Key Points
  • Reduced inference latency by 11-18x compared to vLLM baselines on 4 A800 GPUs.
  • Achieved throughput up to 2,235 tokens/second on summarization benchmarks.
  • Uses adaptive speculative decoding to tune prediction depth in real-time for efficiency.

Why It Matters

This could enable much faster and more responsive AI applications, from chatbots to coding assistants, at a lower computational cost.