Developer Tools

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

arXiv cs.SE March 05, 2026

⚡New benchmark shows AI coding agents often miss the architectural improvements human developers would make.

Deep Dive

Researchers from ETH Zurich have introduced CodeTaste, a novel benchmark designed to evaluate whether large language model coding agents can execute human-level code refactoring—the process of improving code structure without changing its behavior. The benchmark was constructed by mining real-world, multi-file refactoring tasks from open-source repositories, creating a challenging testbed that goes beyond simple code generation. The study reveals a significant gap: while models like GPT-4 and Claude 3 can reliably implement refactorings when given explicit instructions, they frequently fail to autonomously discover the specific architectural improvements that human developers chose in practice when only presented with a general area needing work.

The research team developed a sophisticated scoring system combining repository test suites with custom static analysis to verify both the removal of bad patterns and the introduction of good ones using dataflow reasoning. Their experiments showed that a two-stage 'propose-then-implement' decomposition, where the agent first suggests multiple refactoring options and then executes the best-aligned one, significantly improves performance. This finding suggests that current LLM agents lack the nuanced architectural 'taste' required for proactive code quality improvement, highlighting a key area for future development in making AI coding assistants truly collaborative partners in software engineering.

Key Points

CodeTaste benchmark tests LLMs on real-world refactoring tasks mined from open-source repositories
Frontier models succeed 70%+ with detailed instructions but struggle to identify human-chosen improvements independently
A propose-then-implement strategy boosts alignment, showing promise for future agent architectures

Why It Matters

Reveals a critical gap in AI's ability to improve code architecture, impacting the future of automated software maintenance and technical debt reduction.

Read Original Article

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Why It Matters

Stay Ahead in AI