FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving
New system decouples preemption from scheduling to slash latency and maximize server goodput.
Researchers from Tsinghua University and Microsoft propose FlowPrefill, a new LLM serving system. It introduces Operator-Level Preemption and Event-Driven Scheduling to decouple preemption granularity from scheduling frequency. This solves the head-of-line blocking problem during the compute-intensive prefill phase. Evaluations show it improves maximum goodput by up to 5.6x compared to state-of-the-art systems while better meeting time-to-first-token (TTFT) service level objectives (SLOs).
Why It Matters
Enables AI providers to serve more users faster with the same hardware, reducing costs and improving responsiveness.