Research & Papers

FATE scheduler cuts LLM workflow latency by 32% with future-state awareness

New CP-SAT planning preserves model residency and prefix reuse across multi-stage DAGs.

Deep Dive

Large language model workflows are increasingly built as heterogeneous multi-stage directed acyclic graphs (DAGs), where each stage may involve different models, hardware, and resource constraints. Current scheduling policies—whether from serving systems or traditional DAG schedulers—optimize only immediate queue state, placement cost, or reuse signals in isolation. This fragmentation of useful execution state (like model residency, parent-output locality, and prefix reuse) leads to suboptimal end-to-end latency and makespan.

FATE, created by Zirui Huang, Yi-Xiang Hu, Feng Wu, and Xiangyang Li, introduces a novel future-state-aware approach. Instead of solving a monolithic full-DAG problem, it repeatedly plans over the current ready frontier using a CP-SAT solver, scoring each assignment by both immediate cost and the downstream state it induces. This preserves multiple dimensions of execution state simultaneously. Across real-DAG and controlled prefix-reuse benchmarks, FATE achieves a normalized makespan of 0.675 and P95 latency of 0.677—32.5% and 32.3% better than RoundRobin, and 8.9%/8.8% better than the strongest non-FATE baseline. The paper argues that future-state preservation should be a first-class scheduling objective for heterogeneous LLM workflow serving.

Key Points
  • FATE uses a CP-SAT frontier planner to repeatedly score assignments by immediate cost and induced downstream state, not just prefix reuse.
  • Achieves 32.5% lower normalized makespan and 32.3% lower P95 latency over RoundRobin on real-DAG benchmarks.
  • Outperforms the strongest non-FATE baseline by 8.9% in makespan and 8.8% in P95 latency, with gains from preserving multiple state dimensions.

Why It Matters

Enables significantly faster and more efficient execution of complex multi-model LLM pipelines, critical for production AI services.