Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch
A hybrid AI system combining a 0.8B parameter LLM with CMA-ES optimization outperforms pure methods in hyperparameter tuning.
A research team from the University of Freiburg and collaborators published a study asking if LLMs can beat classical hyperparameter optimization (HPO) algorithms. Using their new autoresearch framework—which allows an LLM agent to edit training code directly in an unconstrained search space—they compared methods for tuning a small language model. Within a fixed search space, classical algorithms like CMA-ES and TPE consistently outperformed LLM-based agents. However, an LLM agent using direct code editing narrowed the performance gap significantly, even when using only a self-hosted, open-weight 27B model.
The study revealed key trade-offs: methods avoiding out-of-memory failures outperformed those with higher search diversity, highlighting that reliability matters more than exploration breadth. While smaller LLMs struggled to track optimization state, classical methods lacked domain knowledge. To bridge this gap, the team introduced Centaur, a hybrid system that shares CMA-ES's internal state (including mean vector, step-size, and covariance matrix) with an LLM. Centaur achieved the best experimental results, with its 0.8B parameter variant outperforming the 27B variant. This suggests a cheap, small LLM is sufficient when paired with a strong classical optimizer, while scaling model size provided no advantage for fixed-space methods with the open-weight models tested. The code is publicly available.
- Classical HPO methods (CMA-ES, TPE) beat pure LLM agents in fixed search spaces, but an LLM editing code directly narrows the gap.
- The hybrid 'Centaur' system, sharing CMA-ES state with an LLM, achieved the best results, with its 0.8B model outperforming a 27B model.
- Reliability (avoiding failures) mattered more than search diversity, and a cheap LLM suffices in a hybrid setup, challenging the need for massive models.
Why It Matters
This research challenges the need for massive LLMs in optimization tasks, showing efficient hybrids can reduce compute costs for AI model tuning.