Research & Papers

Position: Science of AI Evaluation Requires Item-level Benchmark Data

A new paper argues current AI evaluations are flawed and require granular, item-level data for validity.

Deep Dive

A team of researchers, including Han Jiang and four others, has published a position paper arguing that the current paradigm for evaluating AI models is fundamentally flawed. They contend that aggregate benchmark scores often mask systemic validity failures, stemming from unjustified design choices and misaligned metrics. The paper asserts that without granular, item-level data—meaning data on each individual test question or prompt—it's impossible to conduct proper diagnostic analysis or gather meaningful evidence for a benchmark's validity. This lack of rigor is a critical issue as AI systems are increasingly deployed in high-stakes domains based on these evaluations.

To address this, the researchers advocate for a new science of AI evaluation modeled on principles from psychometrics and computer science. They demonstrate how analyzing item properties and latent constructs can reveal unique insights into model capabilities and failures. To catalyze community adoption of this approach, they have introduced OpenEval, a public repository designed to host and share item-level benchmark data. This initiative aims to move the field beyond opaque, top-line scores toward transparent, evidence-centered evaluation that can truly diagnose why a model succeeds or fails.

Key Points
  • Current AI benchmark evaluations exhibit systemic validity failures that aggregate scores hide.
  • The paper proposes a shift to item-level analysis for fine-grained model diagnostics and benchmark validation.
  • Researchers introduced OpenEval, a new public repository for sharing item-level benchmark data.

Why It Matters

More rigorous evaluations are needed to safely deploy generative AI in high-stakes areas like healthcare and finance.