Research & Papers

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

A new batch-level routing method for LLMs improves accuracy by up to 24% while strictly controlling cost and GPU usage.

Deep Dive

A team of researchers including Jelena Markovic-Voronov, Kayhan Behdin, and others has published a paper introducing a novel framework for routing queries to large language models (LLMs). The core problem they address is efficiently distributing user queries across multiple LLMs—like GPT-4, Claude, or Llama—while adhering to strict budgets and limited GPU resources. Traditional "per-query" routing methods often fail when queries are batched together in non-uniform or adversarial patterns, leading to cost overruns and performance bottlenecks. Their solution is a "batch-level, resource-aware" routing system that makes optimal assignments for an entire batch of queries at once, ensuring total cost and model capacity limits are never exceeded.

The paper introduces two key innovations: a robust variant that accounts for uncertainty in predicted LLM performance, and an offline procedure for allocating query instances to balance quality and throughput. In experiments on multi-task LLM benchmarks, the robust routing improved accuracy by 1-14% over non-robust methods, depending on the performance estimator used. More significantly, the batch-level approach outperformed standard per-query routing by up to 24% under adversarial batching conditions. An optimized instance allocation strategy provided an additional performance gain of up to 3%. All these improvements were achieved while maintaining strict adherence to predefined cost and GPU resource constraints, a critical requirement for production deployments.

Key Points
  • Batch-level routing outperforms per-query methods by up to 24% under adversarial query patterns.
  • Robust variant improves accuracy by 1-14% by accounting for uncertainty in LLM performance predictions.
  • Optimized offline instance allocation yields an additional 3% performance gain while respecting cost/GPU limits.

Why It Matters

This enables companies to run multi-LLM systems more cost-effectively and reliably, maximizing performance within strict operational budgets.