This New Framework for Assessing Biological Risks from AI Scientists Has a Catch
As AI agents enter research labs, how do we measure their potential for harm?
A new preprint from researchers including Patricia Paskov tackles a pressing policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI agents — autonomous systems capable of performing multi-step scientific tasks. As these AI scientists enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on implicit or under-documented design choices. The paper synthesizes current evidence on AI-enabled biological risks and introduces 'biological agentic evaluations' as a promising but interpretation-sensitive tool for assessing these systems. The authors draw from their own evaluations to show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk.
The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution, guide public and private funders toward high-leverage investments in AI-biology evaluation research, and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations. This work comes as concerns grow about dual-use risks from increasingly capable AI research tools — from automated protein design to autonomous wet-lab experimentation. The paper provides a structured way to think about what these evaluations actually measure and where they fall short.
- Synthesizes current evidence on AI-enabled biological risks from autonomous agents
- Introduces 'biological agentic evaluations' as a new assessment tool with interpretation caveats
- Provides practical, experience-grounded considerations for design choices affecting risk interpretation
Why It Matters
As AI scientists become more capable, this framework helps policymakers and biosecurity experts assess real biological risks.