ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution
New benchmark reveals a 0% success rate for current AI agents on a critical scientific task.
A research team led by Yubang Wang has published a new benchmark, ResearchEnvBench, designed to test a critical but overlooked capability in AI agents: environment synthesis for research code execution. While current benchmarks measure an agent's skill at writing or repairing code, they typically assume a pre-configured, ready-to-run software environment. ResearchEnvBench removes that crutch, tasking agents with the complex, real-world job of constructing a functional environment from scratch. Given only a research repository (like from GitHub), its documentation, and a target execution setting, the agent must resolve all software dependencies, align hardware and framework versions, and configure any necessary distributed execution to make the code run.
Initial evaluations using ResearchEnvBench on a diverse set of research repositories have delivered a stark verdict on the current state of autonomous AI agents. The benchmark revealed a substantial performance gap, with state-of-the-art agents achieving a 0% success rate. The failures were dominated by two core issues: agents could not fully resolve the complex, nested dependency chains common in research software, and they struggled with the brittle coupling between specific library versions and hardware configurations. This failure highlights that today's agents, while proficient at generating code snippets, are not yet capable of the full-stack problem-solving required for hands-off scientific assistance.
The creation of ResearchEnvBench provides the AI research community with its first realistic testbed for this essential capability. By quantifying this specific shortcoming, the benchmark sets a clear target for future development. Advancing agent performance on ResearchEnvBench is a direct step toward building AI systems that can autonomously replicate studies, validate results, and contribute to reproducible scientific research, moving beyond simple code generation to true computational research support.
- Agents scored 0% success rate on the new ResearchEnvBench benchmark for environment synthesis.
- Failures were dominated by incomplete dependency resolution and brittle software version coupling.
- The benchmark provides the first testbed for advancing AI toward hands-off, reproducible scientific research.
Why It Matters
Shows AI cannot yet autonomously run scientific code, a major hurdle for AI research assistants.