Research & Papers

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

Data-parallel imbalance wasting compute? BalanceRoute fixes it with millisecond-scale routing.

Deep Dive

Large-scale LLM serving suffers from a subtle but costly bottleneck: data-parallel (DP) load imbalance. When model shards are replicated across many DP workers, every decode step waits for the slowest worker. Even small persistent imbalances compound over time, wasting significant compute. The problem is uniquely hard: KV caches make request migration expensive, loads grow during decoding, arrivals are non-stationary, and the router must decide within a sub-100 ms budget across hundreds of requests and tens of workers.

The paper presents BalanceRoute, a practical online routing solution. BR-0 requires no prediction infrastructure, using a piecewise-linear F-score that sharply distinguishes safe admissions from those that trigger overflow. A two-stage decomposition keeps scheduling costs millisecond-scale. BR-H adds a short, constant lookahead and a lightweight termination classifier, extending the F-score to horizon-discounted form. Tested on a 144-NPU cluster against vLLM on both a proprietary production trace and the public Azure-2024 trace, BalanceRoute substantially reduces average DP imbalance and improves end-to-end serving throughput, offering a deployable fix for a first-order efficiency problem.

Key Points
  • BalanceRoute uses a piecewise-linear F-score to capture the asymmetry between safe and overflow admissions, preventing costly imbalances.
  • BR-0 requires no prediction; BR-H adds a short lookahead H and termination classifier for dynamic adjustments.
  • Deployed on a 144-NPU cluster, BalanceRoute outperforms vLLM baselines on both proprietary and Azure-2024 traces, cutting DP imbalance and boosting throughput.

Why It Matters

For LLM serving operators, BalanceRoute provides a practical, low-overhead way to reclaim wasted compute from DP load imbalance.