Research & Papers

BaSE: Multi-Armed Bandit Allocates LLM Calls to Boost Evolutionary Search by 12.3%

A new bandit algorithm allocates compute across evolutionary trajectories, improving results without changing prompts or models.

Deep Dive

LLM-guided evolutionary search (Evolve systems) has achieved state-of-the-art results on mathematical and combinatorial tasks, but most papers report only the best run, ignoring run-to-run variance. A new paper from Sixue Xing, Haoyu He, Kerui Wu, and colleagues asks a critical question: given a fixed budget of LLM calls, how should those calls be allocated across parallel search trajectories to maximize reliability and mean performance?

By sweeping depth-breadth grids over five models (including GPT-4, Claude, and open-weight variants) and three tasks, the authors uncovered two empirical regularities: a fitness-compute envelope where capability differences largely collapse onto a curve driven by effective FLOPs, and a bilinear depth-breadth fit with task-specific interactions. Both are gated by the model's capability relative to the task difficulty. These insights motivated a new algorithm.

Enter BaSE (Bandit-based Self-Evolving), a multi-armed bandit that dynamically allocates compute across parallel trajectories in real time. Without altering the model, prompt, or evaluator, BaSE improved mean fitness by 12.3% over the strongest island-protocol baseline across eight (model, task) cells. The largest gains came in high-variance settings, where allocation alone dramatically improved the reliability of a single run hitting the reported numbers.

This work has immediate practical implications: better compute allocation is a free lunch for teams running LLM-guided search. BaSE makes evolutionary optimization more predictable and efficient, reducing wasted API calls while boosting outcomes. As enterprises increasingly use LLM-based agents for optimization, this allocation-first approach could become standard practice.

Key Points
  • Identified a fitness-compute envelope where model-task capability gates effectiveness and ordering collapses on effective FLOPs.
  • BaSE improves mean fitness by 12.3% across 8 (model, task) cells without changing model, prompt, or evaluator.
  • Largest gains in high-variance settings, improving single-run reliability and reducing wasted LLM calls.

Why It Matters

Better compute allocation alone boosts LLM-guided search by 12%, making evolutionary methods more reliable and cost-efficient.