Measuring What AI Systems Might Do: Towards A Measurement Science in AI
New paper argues current AI benchmarks like MMLU are flawed and proposes a rigorous, scientific framework for evaluation.
A team of leading AI researchers from institutions including the University of Cambridge and the Leverhulme Centre for the Future of Intelligence has published a groundbreaking paper calling for a fundamental overhaul of how we evaluate AI systems. The paper, 'Measuring What AI Systems Might Do: Towards A Measurement Science in AI,' argues that current practices—from simple benchmark averages like those reported for GPT-4o on MMLU to sophisticated latent-variable models—fail to measure what they claim. The authors contend that terms like 'capabilities,' 'skills,' and 'values' are used interchangeably and conflated with observable performance, creating a misleading picture of what systems like Llama 3 or Gemini are truly disposed to do.
The core proposal is to treat AI capabilities as 'dispositions'—stable properties defined by counterfactual relationships between context and behavior. Scientifically measuring a disposition requires three steps: hypothesizing which contextual properties are causally relevant, independently measuring them, and empirically mapping how their variation affects behavioral probability. This framework, drawn from philosophy of science and measurement theory, directly challenges the status quo. It implies that evaluations must move beyond single-number scores and instead systematically test how models respond to controlled variations in prompts, environments, and constraints. This shift is critical for accurately assessing risks, aligning AI with human values, and enabling reliable deployment of autonomous agents.
- Proposes treating AI capabilities as 'dispositions' requiring counterfactual testing, not just observed performance.
- Critiques dominant evaluation methods like benchmark averages and Item Response Theory models as scientifically inadequate.
- Outlines a 3-step framework for rigorous measurement: hypothesize relevant contexts, measure them independently, and map behavioral probabilities.
Why It Matters
Could fundamentally change how we test and trust AI models, impacting safety research, model development, and regulatory standards.