From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models
A study of 230 Java snippets shows LLMs reach a stable, 'optimally readable' version of code after multiple iterations.
A new research paper from Norman Peitek, Julia Hess, and Sven Apel provides a systematic, large-scale study on using Large Language Models (LLMs) for automated code refactoring, specifically to improve readability. The experiment used OpenAI's GPT-5.1 to iteratively refactor 230 Java code snippets across five rounds, testing three distinct prompting strategies. The researchers meticulously categorized the resulting changes into implementation, syntactic, and comment-level transformations, then tested for functional correctness and robustness with new code. This work moves beyond anecdotal evidence to offer a rigorous empirical analysis of how LLMs behave when tasked with improving software quality.
The study's core finding is a clear, two-phase pattern: an initial phase of significant restructuring is followed by a stabilization phase where changes become minimal. This convergence suggests that GPT-5.1 possesses an internalized model of what constitutes "optimally readable" code, and this model is robust across different code variants. While explicit prompting for specific readability factors did influence the refactoring dynamics, the overall convergence trend remained strong. These insights are crucial for developers and engineering teams relying on AI coding assistants, as they indicate a level of predictable reliability in iterative refactoring tasks. The research opens pathways for future comparative studies across different models like Claude or Llama and for evaluating other software quality dimensions such as maintainability or performance in AI-refactored code.
- The study used GPT-5.1 to refactor 230 Java snippets across 5 iterations, revealing a consistent 'restructuring then stabilization' pattern.
- Convergence to a stable, readable version was robust across code variants, indicating LLMs have an internalized readability model.
- Explicit prompting for specific readability factors (e.g., naming, structure) only slightly influenced the refactoring dynamics compared to the strong convergence trend.
Why It Matters
Provides empirical evidence that AI code refactoring is reliable and converges on optimal readability, boosting developer trust in AI coding assistants.