New FML-Bench reveals MLE-Bench gains mostly from better models, not algorithms
MLE-Bench scores jumped from 30% to 80%—but algorithmic progress is minimal.
A new preprint introduces FML-Bench, a benchmark designed to separate algorithmic improvements from confounding factors like better base models and larger search budgets. The paper reveals that MLE-Bench scores surged from approximately 30% to 80% over the past two years, but most of this gain is attributed to advancements in the underlying models (e.g., GPT-4 to GPT-5) and increased search compute, not to novel agent architectures or memory strategies. When researchers controlled for step budget and used the same base model across generations, the two-year-old AIDE algorithm matched or exceeded modern agentic and evolutionary search systems on a held-out set of tasks.
FML-Bench unifies the code editing agent interface, step definition, and validation/test splits to create a more controlled evaluation environment. It specifically benchmarks the algorithmic efficiency of agents—focusing on search strategies, memory utilization, and optimization techniques—rather than raw compute scaling or model capability. The findings suggest that the field may have overestimated genuine algorithmic progress in automated ML research, and that many reported improvements are driven by resource scaling. This work provides a valuable sanity check for the AI community, urging researchers to account for confounding variables when claiming algorithmic breakthroughs.
- MLE-Bench scores rose from ~30% to ~80% over two years, but most gains come from better base models and more search.
- After controlling for step budget and model, the two-year-old AIDE algorithm performs on par with modern agentic/evolutionary systems.
- FML-Bench unifies code editing agents, step definitions, and validation splits to isolate true algorithmic efficiency (search/memory).
Why It Matters
This challenges the narrative of rapid algorithmic progress, reminding the AI field to control for compute and model improvements.