Research & Papers

Multi-stage Flow Scheduling for LLM Serving

New scheduling algorithm cuts AI wait times by prioritizing critical network traffic between GPUs.

Deep Dive

A research team from Hong Kong University of Science and Technology, Shanghai Jiao Tong University, and the University of Science and Technology of China has published a paper introducing MFS (Multi-stage Flow Scheduling), a novel mechanism designed to drastically improve the responsiveness of large language model (LLM) serving systems. The core challenge addressed is meeting stringent Time-To-First-Token (TTFT) requirements—the delay before a user sees the AI's first word—which is critical for user experience. Modern, high-efficiency LLM serving systems use disaggregated architectures with complex parallelisms, creating multi-stage workflows where data (like reusable KV-blocks) must move between GPUs over a network. This creates traffic 'flows' that contend for bandwidth, and existing, stage-agnostic schedulers often cause uncoordinated network congestion that violates service-level objectives (SLOs).

MFS tackles this by approximating a Least-Laxity-First scheduling policy without needing precise knowledge of a request's remaining time slack. Its 'Defer-and-Promote' principle, implemented through a Reverse Multi-Level Queue (RMLQ) structure, dynamically promotes the network priority of tasks as their effective laxity shrinks. This ensures flows critical for TTFT get priority on bottleneck links, while requests with looser deadlines don't prematurely hog bandwidth. The researchers implemented MFS as a pluggable module in the popular vLLM serving framework and evaluated it on an 8-server, 32-GPU testbed alongside large-scale simulations. The results show MFS outperforms state-of-the-art baselines, improving TTFT SLO attainment by 1.2x to 2.4x, meaning far more requests meet their target response time guarantees under heavy load.

Key Points
  • Improves TTFT SLO attainment by 1.2x to 2.4x over existing schedulers by reducing network contention in multi-GPU clusters.
  • Uses a novel 'Defer-and-Promote' principle with a Reverse Multi-Level Queue to dynamically prioritize network flows with less remaining time slack.
  • Implemented as a pluggable module for vLLM, tested on a 32-GPU cluster, addressing a key bottleneck in scalable, disaggregated LLM serving.

Why It Matters

Enables AI service providers to serve more concurrent users with faster, more reliable response times, improving cost-efficiency and user experience.