REFLECT benchmark shows LLM judges fail 45% on evaluating AI research agents
Top LLM judges score below 55% accuracy — especially poor on evidence verification.
Deep Dive
Researchers introduced REFLECT, a meta-evaluation benchmark to test LLM judges that evaluate deep research agents. REFLECT uses a taxonomy of process- and outcome-level failures, applied via controlled interventions on real agent traces. Results show best models achieve <55% overall accuracy on detecting reasoning, tool-use, and report-quality failures, with especially weak performance on evidence verification.
Key Points
- REFLECT introduces a taxonomy of 14 process- and outcome-level failure modes for deep research agents, applied via controlled interventions on real execution traces.
- Best-performing LLM judges (GPT-4, Claude 3.5) achieve <55% accuracy on detecting reasoning, tool-use, and report-quality failures.
- Evidence verification is the weakest area — judges routinely miss hallucinated citations and missing source references.
Why It Matters
Unreliable LLM judges could greenlight flawed AI research agents, undermining trust in automated evidence synthesis at scale.