Research & Papers

GoodServe boosts agentic LLM inference goodput by 27% on heterogeneous GPUs

New system routes LLM requests to best GPUs, boosting timely completions by 27%.

Deep Dive

GoodServe tackles the challenge of serving agentic LLM inferences—where every request's end-to-end latency is critical—across heterogeneous GPU resources. Traditional routing methods often fail to match inference demands with the right GPU, causing SLO violations. GoodServe introduces a predict-and-rectify framework: it first estimates each request's output length (a key latency factor) and tracks real-time GPU load, then uses a just-enough instance selection heuristic to assign requests to the most appropriate GPU. This ensures that requests with shorter outputs aren't over-provisioned, while longer ones get adequate resources. Additionally, GoodServe continuously monitors active requests for SLO-violation risks and can migrate requests to other GPUs mid-execution if conditions shift, adapting to unpredictable dynamics like GPU contention or sudden load spikes.

Evaluated on a realistic testbed with mixed GPU types (e.g., A100s, H100s), GoodServe achieves up to a 27.4% improvement in goodput—the fraction of requests completed within their latency SLOs—compared to state-of-the-art routing policies. The system's lightweight overhead and practical assumptions (no need for perfect future knowledge) make it deployable in existing LLM-serving infrastructure. For organizations running agentic AI on diverse GPU pools, GoodServe offers a drop-in optimization to maximize resource utilization while maintaining strict latency guarantees.

Key Points
  • Uses predict-and-rectify: estimates request output lengths and GPU status for informed routing.
  • 'Just-enough instance selection' heuristic prevents over-allocation while meeting SLOs.
  • Monitors SLO-violation risks and triggers runtime request migrations on heterogeneous GPU pools.

Why It Matters

Enables efficient use of diverse GPU resources for latency-critical agentic AI applications, reducing SLO violations.