Developer Tools

Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering

arXiv cs.SE April 03, 2026

⚡A new paper analyzes 18 studies to tackle the 'black box' problem in AI software engineering.

Deep Dive

Researchers Jingyue Li and André Storhaug have published a position paper titled 'Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering.' The work, accepted to the ResponsibleSE 2026 workshop, critically analyzes 18 recent papers from top conferences like ICSE and FSE. Their core finding is that current evaluations of AI agents (autonomous systems built on LLMs like GPT-4 or Claude) for coding tasks are often flawed. The underlying LLMs act as 'black boxes,' making it impossible to justify why one agent outperforms another, and missing details in research papers frequently make results impossible to reproduce.

To address these critical shortcomings, the authors propose a concrete set of guidelines to standardize future research. Their key recommendation is for researchers to publicly share the detailed 'Thought-Action-Result' (TAR) trajectories and the raw LLM interaction data (or summarized versions) generated by their AI agents during evaluations. This data would provide a transparent, step-by-step record of the agent's reasoning, actions, and outcomes. The paper includes a proof-of-concept case study demonstrating how analyzing these TAR trajectories enables systematic, apples-to-apples comparisons between different Agentic AI approaches, revealing their specific strengths and weaknesses beyond a simple pass/fail score.

Key Points

Analyzed 18 papers from top SE conferences (ICSE, FSE, ASE) and found evaluations lack reproducibility and explainability.
Proposes new guidelines recommending public sharing of Thought-Action-Result (TAR) trajectories and LLM interaction data.
Includes a proof-of-concept case study showing how TAR data enables systematic comparison of different AI agent approaches.

Why It Matters

This pushes for transparency in AI coding tools, allowing developers to trust and effectively compare agents like GitHub Copilot, Cursor, and Devin.

Read Original Article

Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering

Why It Matters

Stay Ahead in AI