Research & Papers

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

arXiv cs.AI February 18, 2026

⚡GPT-5 agents completed only 26.5% of sub-tasks in a new benchmark based on real ICML and ACL papers.

Deep Dive

Researchers Aniketh Garikaparthi, Manasi Patwardhan, and Arman Cohan built ResearchGym, a benchmark for evaluating AI research agents. It uses 5 real conference papers (ICML, ICLR, ACL) with 39 sub-tasks where agents must propose hypotheses and run experiments. In tests, a GPT-5-powered agent improved baselines in just 1 of 15 evaluations (6.7%) and completed only 26.5% of tasks, highlighting major reliability gaps in autonomous research.

Why It Matters

Shows autonomous AI research is still unreliable, with agents struggling with long-horizon planning and experiment coordination.

Read Original Article

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Why It Matters

Stay Ahead in AI