Developer Tools

ClarEval: A Benchmark for Evaluating Clarification Skills of Code Agents under Ambiguous Instructions

New benchmark forces AI coding agents to ask questions, exposing a critical gap in real-world collaboration.

Deep Dive

A team of researchers has introduced ClarEval, a paradigm-shifting benchmark designed to move AI code agents from passive instruction followers to proactive collaborative partners. The core problem identified is that current evaluations reward agents for 'guessing' user intent under ideal conditions, neglecting the critical ability to align through dialogue—a necessity for real-world software engineering. ClarEval simulates the inherent ambiguity of human communication by systematically injecting three types of realistic ambiguity into standard coding tasks: missing goals, missing premises, and ambiguous terminology. This forces agents to step out of a pure 'generator' role and engage in requirement elicitation, quantifying this capability with a new metric suite.

The benchmark's key metrics, Average Turns to Clarify (ATC) and Key Question Coverage (KQC), measure not just final code correctness but the efficiency and precision of the collaborative process itself. In experiments across eleven state-of-the-art agents, including models like GPT-5-Coder, ClarEval revealed a stark performance gap: while these agents excel at generating code from clear prompts, they often lack the strategic communication skills needed for efficient partnership. The findings serve as a crucial roadmap for AI developers, highlighting that building a strong coder is different from building a capable collaborator. The publicly available framework pushes the field toward developing AI that can truly integrate into human workflows by asking the right questions before writing a single line of code.

Key Points
  • Introduces a new benchmark (ClarEval) that injects three types of ambiguity—missing goals, premises, and terminology—into coding tasks.
  • Proposes new metrics like Average Turns to Clarify (ATC) to measure an AI's 'Collaborative Quotient,' not just its final output.
  • Evaluation of 11 top agents, including GPT-5-Coder, shows they excel at coding but lack strategic communication for real-world collaboration.

Why It Matters

It exposes a critical weakness in current AI assistants and provides a roadmap for building true collaborative partners for developers.