FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism
New research solves the 'cost cliff' where long prompts consume 8-42x more GPU capacity than short ones.
A team from Carnegie Mellon University and the University of Chicago has published a paper on FleetOpt, a novel framework designed to drastically reduce the cost of running large language model (LLM) inference at scale. The core problem FleetOpt tackles is the inefficient provisioning of GPU fleets for worst-case, ultra-long context lengths, which leaves expensive GPU memory (KV-cache) idle for the vast majority of shorter requests. FleetOpt's analytical model proves the most cost-efficient architecture is a two-pool system: one for short-context requests and another for long-context requests, with an optimal boundary between them.
The fundamental barrier to this efficiency is the 'cost cliff,' where a request just above the optimal boundary can consume 8 to 42 times more GPU capacity than one just below it. FleetOpt's breakthrough is its Compress-and-Route (C&R) implementation mechanism. C&R uses extractive compression at the gateway layer to strategically trim 'borderline' long prompts, fitting them into the cheaper short-context pool and effectively smoothing out the cost cliff. The unified FleetOpt planner can calculate the optimal fleet configuration—including the number of GPUs in each pool and the compression boundary—in under 1 millisecond.
Validated against a discrete-event simulator with less than 3% error, the framework was tested on three production traces. The results show total GPU cost reductions of 6% to 82% compared to a standard homogeneous fleet. The C&R mechanism alone was responsible for an additional 1 to 44 percentage points of savings beyond simple two-pool routing, depending on the workload. This represents a major step toward more economical and sustainable large-scale AI deployment.
- Solves the 'cost cliff' where long-context requests consume 8-42x more GPU capacity than short ones.
- Uses gateway-layer Compress-and-Route (C&R) to trim prompts and route them to cheaper GPU pools.
- Reduces total GPU inference costs by 6-82% on production traces, with C&R adding up to 44 points of savings.
Why It Matters
This could dramatically lower the operating costs for companies running LLM APIs like OpenAI's GPT-4 or Anthropic's Claude, making AI more accessible.