inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference
New open-source tool uses queueing theory to slash AI inference costs and prevent fleet failures.
A team of researchers from Carnegie Mellon University and McGill University has released inference-fleet-sim, a novel open-source tool designed to solve a critical and expensive problem in AI infrastructure: correctly sizing a GPU fleet for large language model (LLM) inference. The tool addresses the complex, non-linear relationship between workload patterns (like heavy-tailed token-length distributions), routing policies, and hardware performance. Unlike existing tools that optimize a fixed fleet, inference-fleet-sim answers the upstream business question: how many GPUs to buy, which types (A10G, A100, H100), and how to arrange them in monolithic, two-pool, or disaggregated topologies.
The core innovation is its hybrid approach, combining analytical M/G/c queueing models with discrete-event simulation (DES) and a physics-informed GPU performance model. This allows it to find the minimum-cost fleet configuration that empirically meets strict service-level objectives, like a P99 Time-To-First-Token (TTFT) target. The researchers validated the tool on seven real-world scenarios using public traces from LMSYS and Azure, plus a synthetic agent-heavy workload. In each case, the simulation revealed optimal configurations that simple analysis got wrong, such as the correct split threshold for a two-pool fleet or whether an apparently idle fleet was actually at risk of breaking under load.
For AI platform engineers and cloud architects, this tool provides a data-driven method to move beyond guesswork and rules of thumb. It enables cost-effective provisioning by preventing both wasteful over-spending on excess GPUs and the severe business risk of under-provisioning that leads to missed SLOs and broken services. By simulating the joint effects of queueing, routing, and hardware without needing physical access, teams can plan robust, efficient inference infrastructure before making multi-million dollar procurement decisions.
- Uses hybrid M/G/c queueing theory & discrete-event simulation to model complex LLM inference dynamics.
- Tests 7 real-world scenarios with LMSYS/Azure traces, finding optimal GPU configs simple analysis misses.
- Includes physics-informed performance models for A10G, A100, H100 GPUs across multiple fleet topologies.
Why It Matters
Prevents costly over-provisioning and service failures for companies running LLMs at scale, directly impacting infrastructure budgets and reliability.