Research & Papers

Expected Reward Prediction, with Applications to Model Routing

New method routes prompts to the best LLM before inference, cutting costs by predicting expected reward scores.

Deep Dive

A team of researchers from Google and the Vector Institute has published a paper on a novel AI routing technique called Expected Reward Prediction (ERP). The core innovation is the ability to predict how well a specific large language model (LLM) will perform on a given prompt *before* it generates any text. This is achieved by lifting the function of standard reward models—which typically score and rank completed responses—to estimate the 'expected reward' a model would earn under repeated sampling for that prompt.

This predictive capability enables a practical application: intelligent model routing at inference time. In their experiments, the researchers created a pool of five open-source models, including Meta's Llama3.1-Instruct (8B and 70B parameters) and Google's Gemma2-IT (9B and 27B). For each incoming prompt, the ERP system calculates which model in the pool is predicted to deliver the best reward score. It then routes the prompt exclusively to that model, maximizing output quality while controlling computational cost by avoiding unnecessary runs of larger, more expensive models.

The method demonstrated superior performance on the open-perfectblend dataset compared to simpler baselines, such as routing prompts based on a model's average performance within a broad category. The researchers argue that ERP explains the success of more complex routing protocols and offers a key advantage: it is "trivially extensible," meaning new models can be added to the routing pool without a complete system overhaul. This work, presented at the ICML 2025 Workshop on Human Feedback, provides a formalized and efficient framework for building heterogeneous, multi-model AI systems.

Key Points
  • Predicts an LLM's performance score (expected reward) for a prompt before text generation begins.
  • Outperformed category-based routing in tests using a pool of 5 models including Llama3.1 70B and Gemma2 27B.
  • Enables cost-aware inference by routing queries to the optimally performing model, saving computational resources.

Why It Matters

Enables efficient, automated use of multiple AI models, reducing costs and improving response quality for scaled applications.