HexAGenT scheduler cuts agentic LLM latency by 20-33%
New workflow-aware scheduler reduces SLO requirements by up to 80.5% on heterogeneous GPU clusters
Agentic LLM applications—such as planning, tool use, branching, and synthesis—execute user requests as multi-step workflows where users experience end-to-end latency, not individual LLM call latency. Scheduling these workflows efficiently across heterogeneous prefill-decode disaggregated clusters is challenging because dependencies reveal incrementally at runtime, calls vary in prompts and KV-cache needs, and prefill versus decode stages impose different compute/memory constraints.
HexAGenT, introduced in a paper on arXiv (2605.16637), addresses this by modeling each request as an online-revealed DAG (directed acyclic graph). It maintains a running estimate of the workflow's standalone completion horizon and prioritizes ready calls based on projected risk of missing that horizon. The scheduler jointly selects prefill placement, decode placement, and local queue priority while accounting for KV-cache capacity and cross-stage transfer latency across heterogeneous GPUs (A100, H100, H200).
In experiments with representative agentic workloads, HexAGenT significantly outperforms baseline schedulers. It reduces the SLO (service-level objective) scale required for timely workflow completion by an average of 20.1% at 95% attainment and 33.0% at 99% attainment, with maximum reductions of 45.0% and 80.5% respectively. These gains come from efficiently handling the heterogeneity in GPU types and the dynamic nature of agentic workflows.
The work is particularly relevant as agentic AI systems move into production, where multi-step reasoning and tool use are common. By making scheduling workflow-aware and heterogeneity-aware, HexAGenT enables lower latency and more reliable responses without requiring hardware upgrades—a practical advance for deploying sophisticated LLM agents at scale.
- Models agentic workflows as online-revealed DAGs to account for runtime dependencies
- Reduces SLO scale by 20.1% at 95% and 33.0% at 99% attainment across heterogeneous clusters
- Supports varying GPU types (A100/H100/H200) with KV-cache and cross-stage transfer constraints
Why It Matters
Enables faster, more reliable agentic AI systems by optimizing multi-step LLM workflows in production.