CodeTaste: Can LLMs Generate Human-Level Code Refactorings?
New benchmark shows AI coding agents often miss the architectural improvements human developers would make.
Researchers from ETH Zurich have introduced CodeTaste, a novel benchmark designed to evaluate whether large language model coding agents can execute human-level code refactoring—the process of improving code structure without changing its behavior. The benchmark was constructed by mining real-world, multi-file refactoring tasks from open-source repositories, creating a challenging testbed that goes beyond simple code generation. The study reveals a significant gap: while models like GPT-4 and Claude 3 can reliably implement refactorings when given explicit instructions, they frequently fail to autonomously discover the specific architectural improvements that human developers chose in practice when only presented with a general area needing work.
The research team developed a sophisticated scoring system combining repository test suites with custom static analysis to verify both the removal of bad patterns and the introduction of good ones using dataflow reasoning. Their experiments showed that a two-stage 'propose-then-implement' decomposition, where the agent first suggests multiple refactoring options and then executes the best-aligned one, significantly improves performance. This finding suggests that current LLM agents lack the nuanced architectural 'taste' required for proactive code quality improvement, highlighting a key area for future development in making AI coding assistants truly collaborative partners in software engineering.
- CodeTaste benchmark tests LLMs on real-world refactoring tasks mined from open-source repositories
- Frontier models succeed 70%+ with detailed instructions but struggle to identify human-chosen improvements independently
- A propose-then-implement strategy boosts alignment, showing promise for future agent architectures
Why It Matters
Reveals a critical gap in AI's ability to improve code architecture, impacting the future of automated software maintenance and technical debt reduction.