ResearchGym: Evaluating Language Model Agents on Real-World AI Research
GPT-5 agents completed only 26.5% of sub-tasks in a new benchmark based on real ICML and ACL papers.
Deep Dive
Researchers Aniketh Garikaparthi, Manasi Patwardhan, and Arman Cohan built ResearchGym, a benchmark for evaluating AI research agents. It uses 5 real conference papers (ICML, ICLR, ACL) with 39 sub-tasks where agents must propose hypotheses and run experiments. In tests, a GPT-5-powered agent improved baselines in just 1 of 15 evaluations (6.7%) and completed only 26.5% of tasks, highlighting major reliability gaps in autonomous research.
Why It Matters
Shows autonomous AI research is still unreliable, with agents struggling with long-horizon planning and experiment coordination.