Lodestar slashes LLM inference latency by 1.41x with online learning
Self-adapting router learns optimal GPU assignments in 5 minutes, beating heuristics by up to 4.38x.
Efficiently routing LLM inference requests across GPU clusters is notoriously difficult due to input-dependent execution, batching, KV-cache reuse, and nonlinear latency responses. Traditional load-balancing and hand-tuned heuristics fall short. Lodestar, a new learning-based system from researchers at a major cloud provider (co-authors include Brighten Godfrey, among others), solves this by treating routing as a reinforcement learning problem. It continuously collects per-request snapshots — including real-time instance state, request context length, and observed TTFT — and trains an online reward predictor that chooses the GPU instance to minimize time-to-first-token (TTFT) or maximize a custom reward.
Tested on public cloud GPU clusters, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT versus a leading prefix cache and load-aware heuristic, with even larger gains (up to 4.38x/4.42x) on heterogeneous hardware. It learns optimal policies within about five minutes and is cloud-native, working with existing serving stacks like vLLM. For professionals running LLM services, this means faster responses, better GPU utilization, and no manual tuning — all with automatic adaptation to workload spikes and infrastructure changes.
- Lodestar uses online reinforcement learning to route inference requests, adapting to changing workloads and heterogeneous GPUs in under 5 minutes.
- Reduces average TTFT by 1.41x and P99 TTFT by 1.47x vs. state-of-the-art heuristics; up to 4.38x on heterogeneous clusters.
- Cloud-native, integrates seamlessly with vLLM and other serving stacks without code changes.
Why It Matters
Lodestar automates LLM inference routing to cut latency and boost GPU utilization, crucial for cost and user experience at scale.