LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
A new benchmark of 1,900 tasks shows top AI models struggle with practical scientific work, with accuracy dropping sharply.
A consortium of researchers from institutions like the University of Rochester and Arc Institute has launched LABBench2, a major evolution of the benchmark for evaluating AI systems in biology. The new suite comprises nearly 1,900 tasks that shift focus from testing rote scientific knowledge to measuring an AI's ability to perform meaningful, real-world research work, such as experimental design and data analysis. This represents a critical step toward assessing AI agents and autonomous labs that can actively participate in the scientific process.
When tested on current frontier models like GPT-4 and Claude 3, LABBench2 revealed a substantial 'difficulty jump' compared to the original LAB-Bench. Model-specific accuracy dropped between 26% and 46% across various subtasks, underscoring that while AI has improved on theoretical benchmarks, significant gaps remain in practical, applied scientific reasoning. The team has made the full dataset and an evaluation harness publicly available to spur development, positioning LABBench2 as the new de facto standard for measuring progress toward AI-driven scientific discovery.
- Contains nearly 1,900 tasks focused on real-world scientific capabilities like experimental design and analysis.
- Reveals a 26% to 46% accuracy drop for top models versus its predecessor, highlighting a major practical skills gap.
- Publicly released with a dataset and eval harness to standardize measurement of AI progress in science.
Why It Matters
It sets a harder, more realistic standard for developing AI that can truly accelerate research in labs and biotech.