Curation-Bench fixes model, training recipe, and evaluation, giving agents command-line access to inspect, implement, and revise data policies?

Curation-Bench fixes model, training recipe, and evaluation, giving agents command-line access to inspect, implement, and revise data policies

Out-of-the-box agents match strong data-selection baselines within 10 iterations, but exhibit an 'execution-research gap' by mostly tuning local policies?

Out-of-the-box agents match strong data-selection baselines within 10 iterations, but exhibit an 'execution-research gap' by mostly tuning local policies

A scaffolded approach forces method citation and adaptation, enabling agents to autonomously design a policy that beats baselines using 1/10th the data?

A scaffolded approach forces method citation and adaptation, enabling agents to autonomously design a policy that beats baselines using 1/10th the data

Research & Papers

AI agents automate data curation, beating baselines at 1/10th data budget

arXiv cs.AI June 04, 2026

⚡New benchmark reveals agents can autonomously design superior data policies…

Deep Dive

A team led by Feiyang Kang from UC Berkeley and other institutions released Curation-Bench, a benchmark that tests whether generalist coding agents can handle the tedious, iterative process of curating training data for AI models. The benchmark fixes the model, training recipe, and evaluation suite, then gives agents command-line access to inspect data, implement policies, submit them to a fixed pipeline, and revise based on feedback.

Out-of-the-box agents can match established data-selection baselines within 10 iterations. But trajectory analysis reveals a persistent 'execution-research gap': agents tend to tune local policy variants instead of exploring new families of policies, even when given strategy guides and paper references. However, when agents are scaffolded with requirements to cite, instantiate, and adapt a prior method in each iteration, they shift toward method-guided exploration. In this mode, the agent autonomously composed a data-selection policy that outperforms strong published baselines while using only one-tenth the data budget. The authors conclude that current agents can run the curation loop but reliable data research requires structured method adaptation, not open-ended prompting.

Key Points

Curation-Bench fixes model, training recipe, and evaluation, giving agents command-line access to inspect, implement, and revise data policies
Out-of-the-box agents match strong data-selection baselines within 10 iterations, but exhibit an 'execution-research gap' by mostly tuning local policies
A scaffolded approach forces method citation and adaptation, enabling agents to autonomously design a policy that beats baselines using 1/10th the data

Why It Matters

Automating data curation—the most labor-intensive part of AI development—could dramatically reduce costs and accelerate model improvement cycles.

Read Original Article

AI agents automate data curation, beating baselines at 1/10th data budget

Why It Matters

Related Articles

🚀 Stay Ahead in AI