SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters
Treating entire agent workflows as units slashes latency by 64% on 64 GPUs
AI agents—like SWE-bench coding bots and WebArena browser assistants—execute tens to hundreds of chained LLM calls per task. Yet current GPU schedulers treat each call independently, discarding gigabytes of intermediate state and inflating end-to-end latency by 3–8x. In a new paper accepted to HPDC '26, researchers say this request-level abstraction is fundamentally broken for compound AI workloads. Their solution: SAGA, a distributed scheduler that elevates the entire agent workflow to a first-class schedulable unit.
SAGA introduces three key mechanisms: Agent Execution Graphs that capture workflow structure to predict KV cache reuse across tool-call boundaries (achieving within 1.31x of Bélády's optimal offline policy); session-affinity batching with work stealing that co-locates correlated requests while balancing load globally; and Agent Fair Share, a task-completion-time fairness metric with provable bounded-deviation guarantees. On a 64-GPU cluster running real coding and browser agents, SAGA reduced task completion time by 1.64x (geometric mean, p<0.001) versus vLLM v0.15.1 with prefix caching and affinity routing, improved GPU memory utilization by 1.22x, and hit 99.2% of SLOs under multi-tenant interference.
The gains come at a measured cost: roughly 30% lower peak throughput than batch-optimal scheduling. That's a deliberate tradeoff for latency-sensitive interactive deployments, which dominate compound AI usage. The authors argue workflow-aware scheduling is essential as agents become mainstream, and SAGA proves it's practical at scale.
- SAGA treats entire agent workflows as atomic units, recovering 3-8x latency inflation caused by per-request scheduling
- On 64 GPUs, it reduces task completion time by 1.64x vs vLLM, improves GPU memory utilization 1.22x, and achieves 99.2% SLO attainment
- Achieves near-optimal KV cache reuse (1.31x of Bélády's offline policy) using Agent Execution Graphs that predict reuse across tool calls
Why It Matters
Real-world AI agents can run up to 64% faster on existing GPU clusters with smarter scheduling