A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era
New research reveals a major blind spot in how we test AI code reviewers.
Deep Dive
A new survey of 99 research papers (2015-2025) exposes a critical lack of systematic benchmarks for evaluating AI-powered code review. The study, analyzing 58 pre-LLM and 41 LLM-era papers, found datasets are scattered and evaluations provide limited insight. It reveals a clear shift towards generative peer review and increasing multilingual coverage, but current methods fail to assess what capabilities are truly being tested, hindering progress in automated code quality.
Why It Matters
Without better benchmarks, we can't accurately measure or improve the AI tools developers rely on for code quality.