Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
A new batch-level routing method for LLMs improves accuracy by up to 24% while strictly controlling cost and GPU usage.
A team of researchers including Jelena Markovic-Voronov, Kayhan Behdin, and others has published a paper introducing a novel framework for routing queries to large language models (LLMs). The core problem they address is efficiently distributing user queries across multiple LLMs—like GPT-4, Claude, or Llama—while adhering to strict budgets and limited GPU resources. Traditional "per-query" routing methods often fail when queries are batched together in non-uniform or adversarial patterns, leading to cost overruns and performance bottlenecks. Their solution is a "batch-level, resource-aware" routing system that makes optimal assignments for an entire batch of queries at once, ensuring total cost and model capacity limits are never exceeded.
The paper introduces two key innovations: a robust variant that accounts for uncertainty in predicted LLM performance, and an offline procedure for allocating query instances to balance quality and throughput. In experiments on multi-task LLM benchmarks, the robust routing improved accuracy by 1-14% over non-robust methods, depending on the performance estimator used. More significantly, the batch-level approach outperformed standard per-query routing by up to 24% under adversarial batching conditions. An optimized instance allocation strategy provided an additional performance gain of up to 3%. All these improvements were achieved while maintaining strict adherence to predefined cost and GPU resource constraints, a critical requirement for production deployments.
- Batch-level routing outperforms per-query methods by up to 24% under adversarial query patterns.
- Robust variant improves accuracy by 1-14% by accounting for uncertainty in LLM performance predictions.
- Optimized offline instance allocation yields an additional 3% performance gain while respecting cost/GPU limits.
Why It Matters
This enables companies to run multi-LLM systems more cost-effectively and reliably, maximizing performance within strict operational budgets.