Massive Survey Analyzes 99 Papers on AI Code Review Benchmarks
New research reveals a major blind spot in how we test AI code reviewers.
A new survey of 99 research papers (2015-2025) exposes a critical lack of systematic benchmarks for evaluating AI-powered code review. The study, analyzing 58 pre-LLM and 41 LLM-era papers, found datasets are scattered and evaluations provide limited insight. It reveals a clear shift towards generative peer review and increasing multilingual coverage, but current methods fail to assess what capabilities are truly being tested, hindering progress in automated code quality.
Why It Matters
Without better benchmarks, we can't accurately measure or improve the AI tools developers rely on for code quality.