Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering
A new paper analyzes 18 studies to tackle the 'black box' problem in AI software engineering.
Researchers Jingyue Li and André Storhaug have published a position paper titled 'Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering.' The work, accepted to the ResponsibleSE 2026 workshop, critically analyzes 18 recent papers from top conferences like ICSE and FSE. Their core finding is that current evaluations of AI agents (autonomous systems built on LLMs like GPT-4 or Claude) for coding tasks are often flawed. The underlying LLMs act as 'black boxes,' making it impossible to justify why one agent outperforms another, and missing details in research papers frequently make results impossible to reproduce.
To address these critical shortcomings, the authors propose a concrete set of guidelines to standardize future research. Their key recommendation is for researchers to publicly share the detailed 'Thought-Action-Result' (TAR) trajectories and the raw LLM interaction data (or summarized versions) generated by their AI agents during evaluations. This data would provide a transparent, step-by-step record of the agent's reasoning, actions, and outcomes. The paper includes a proof-of-concept case study demonstrating how analyzing these TAR trajectories enables systematic, apples-to-apples comparisons between different Agentic AI approaches, revealing their specific strengths and weaknesses beyond a simple pass/fail score.
- Analyzed 18 papers from top SE conferences (ICSE, FSE, ASE) and found evaluations lack reproducibility and explainability.
- Proposes new guidelines recommending public sharing of Thought-Action-Result (TAR) trajectories and LLM interaction data.
- Includes a proof-of-concept case study showing how TAR data enables systematic comparison of different AI agent approaches.
Why It Matters
This pushes for transparency in AI coding tools, allowing developers to trust and effectively compare agents like GitHub Copilot, Cursor, and Devin.