Best LLM agent (from Claude family) resolved only 47.7% of true architectural code smells?

Best LLM agent (from Claude family) resolved only 47.7% of true architectural code smells.

63.1% of tool-detected smells were false positives, but agents identified them with 0.94 expert agreement?

63.1% of tool-detected smells were false positives, but agents identified them with 0.94 expert agreement.

Most aggressive repair agent introduced 140 new smells, showing inverse relationship with code quality?

Most aggressive repair agent introduced 140 new smells, showing inverse relationship with code quality.

Developer Tools

SmellBench reveals LLM agents struggle with architectural code smell repair

arXiv cs.SE May 11, 2026

⚡63.1% of detected smells are false positives, best agent only 47.7% successful.

Deep Dive

A new study by researchers from the University of Craiova and University Politehnica of Bucharest introduces SmellBench, the first framework designed to evaluate large language model agents on repairing architectural code smells – cross-module design flaws that erode software maintainability. Unlike localized bugs, these smells require understanding of overall system architecture, a challenge both for developers and automated tools. The team tested 11 agent configurations from four major model families—GPT, Claude, Gemini, and Mistral—on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, with all results validated against expert judgments.

The findings reveal a sobering reality: 63.1% of the smells flagged by tools were false positives, and even the best-performing agent only resolved 47.7% of genuine smells. While agents could identify false positives with near-expert agreement (kappa = 0.94), their repair attempts often backfired – the most aggressive agent introduced 140 new smells. This inverse relationship between repair aggressiveness and net codebase quality exposes a fundamental gap: current LLMs excel at localized code transformations but lack the architectural reasoning needed for safe cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this critical but underexplored dimension of automated software engineering.

Key Points

Best LLM agent (from Claude family) resolved only 47.7% of true architectural code smells.
63.1% of tool-detected smells were false positives, but agents identified them with 0.94 expert agreement.
Most aggressive repair agent introduced 140 new smells, showing inverse relationship with code quality.

Why It Matters

Highlights critical limitations of current LLMs for real-world software maintenance, guiding future research on architectural reasoning.

Read Original Article

SmellBench reveals LLM agents struggle with architectural code smell repair

Why It Matters

Related Articles

🚀 Stay Ahead in AI