Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift
A new paper argues AI safety should measure how much a model increases a user's ability to cause harm.
A team of researchers from MIT and Google has published a position paper titled 'Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift' on arXiv. The authors—Michelle Vaccaro, Jaeyoon Song, Abdullah Almaatouq, and Michiel A. Bakker—argue that current frontier AI safety evaluations, which emphasize static benchmarks, third-party annotations, and red-teaming, are fundamentally incomplete. They propose shifting the focus to human-centered evaluations that specifically measure what they term 'harmful capability uplift.' This is defined as the marginal increase in a user's ability to cause harm when using a frontier model like GPT-4 or Claude 3, compared to what they could achieve with conventional tools like search engines or existing software.
The paper grounds this concept in prior social science research and provides concrete methodological guidance for its systematic measurement. The authors contend that this metric is crucial because it directly assesses the real-world risk amplification posed by powerful AI systems, rather than just their intrinsic capabilities in a vacuum. They conclude with actionable steps for AI developers, safety researchers, funders, and regulators to make harmful capability uplift evaluation a standard practice in the industry. This represents a significant shift in thinking, moving safety assessment from a model-centric to a human-in-the-loop paradigm.
- Proposes 'harmful capability uplift' as a core AI safety metric, measuring the marginal increase in a user's harmful ability.
- Critiques current methods like static benchmarks and red-teaming as insufficient for assessing real-world risk.
- Provides concrete methodological guidance and calls for adoption by developers, researchers, funders, and regulators.
Why It Matters
This could fundamentally change how AI safety is tested, focusing on real-world risk amplification instead of abstract benchmarks.