AI agents automate data curation, beating baselines at 1/10th data budget
New benchmark reveals agents can autonomously design superior data policies…
A team led by Feiyang Kang from UC Berkeley and other institutions released Curation-Bench, a benchmark that tests whether generalist coding agents can handle the tedious, iterative process of curating training data for AI models. The benchmark fixes the model, training recipe, and evaluation suite, then gives agents command-line access to inspect data, implement policies, submit them to a fixed pipeline, and revise based on feedback.
Out-of-the-box agents can match established data-selection baselines within 10 iterations. But trajectory analysis reveals a persistent 'execution-research gap': agents tend to tune local policy variants instead of exploring new families of policies, even when given strategy guides and paper references. However, when agents are scaffolded with requirements to cite, instantiate, and adapt a prior method in each iteration, they shift toward method-guided exploration. In this mode, the agent autonomously composed a data-selection policy that outperforms strong published baselines while using only one-tenth the data budget. The authors conclude that current agents can run the curation loop but reliable data research requires structured method adaptation, not open-ended prompting.
- Curation-Bench fixes model, training recipe, and evaluation, giving agents command-line access to inspect, implement, and revise data policies
- Out-of-the-box agents match strong data-selection baselines within 10 iterations, but exhibit an 'execution-research gap' by mostly tuning local policies
- A scaffolded approach forces method citation and adaptation, enabling agents to autonomously design a policy that beats baselines using 1/10th the data
Why It Matters
Automating data curation—the most labor-intensive part of AI development—could dramatically reduce costs and accelerate model improvement cycles.