Developer Tools

A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era

arXiv cs.SE February 17, 2026

⚡New research reveals a major blind spot in how we test AI code reviewers.

Deep Dive

A new survey of 99 research papers (2015-2025) exposes a critical lack of systematic benchmarks for evaluating AI-powered code review. The study, analyzing 58 pre-LLM and 41 LLM-era papers, found datasets are scattered and evaluations provide limited insight. It reveals a clear shift towards generative peer review and increasing multilingual coverage, but current methods fail to assess what capabilities are truly being tested, hindering progress in automated code quality.

Why It Matters

Without better benchmarks, we can't accurately measure or improve the AI tools developers rely on for code quality.

Read Original Article

A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era

Why It Matters

Stay Ahead in AI