Agent Frameworks

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

1,197 trials without human intervention yield non-trivial code edits and recipe improvements.

Deep Dive

A team of researchers (Jingjie Ning, Xiaochuan Li, Ji Zeng, Hao Kang, Chenyan Xiong) introduced an automated research framework where specialist AI agents conduct a closed empirical loop. Each trial submits a hypothesis, an executable code edit, and an evaluator-owned outcome with feedback that shapes the next proposal. The output is an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels — not just a paper or model checkpoint. The agents partition recipe surfaces and share measured lineage across trials, turning evaluator outcomes (crashes, budget overruns, size failures, accuracy misses) into program-level recipe edits over time.

Across three headline runs (totaling 1,197 trials plus 600 Parameter Golf control trials), humans never chose proposals, edited recipes, overrode scores, or repaired failed trials. The results are striking: Parameter Golf validation bpb reduced by 0.81%, NanoChat-D12 CORE increased by 38.7%, and CIFAR-10 Airbench96 wallclock time dropped by 4.59%. The trace includes a strict architecture-domain audit of 157 submissions and autonomous program rewrites such as a NanoChat attention-kernel path change. This demonstrates that specialist agents can autonomously explore and apply known techniques inside each environment, improving public starting recipes without human guidance.

Key Points
  • 1,197 autonomous headline-run trials plus 600 control trials completed with zero human intervention after setup.
  • NanoChat-D12 CORE improved by 38.7%, CIFAR-10 Airbench96 wallclock reduced by 4.59%.
  • Agents turned evaluator failures (crashes, budget overruns) into successful program-level recipe edits.

Why It Matters

This accelerates ML research automation, reducing human time while discovering non-obvious training improvements.