EvoTrace reveals most AI coding 'evolution' is just code recycling
30% of lines added are byte-identical re-introductions of deleted code.
A new study from researchers led by Nico Pelleriti tackles a fundamental question in AI: when we use LLMs with evolutionary search to generate and select code, what are we actually evolving? Prior work has shown impressive results in mathematical discovery and algorithm design, but progress is typically measured only by final benchmark scores. The team introduces EvoTrace, a dataset spanning four evolutionary coding frameworks, reasoning and non-reasoning models, and 16 tasks. They also develop EvoReplay, a replay-based methodology that reconstructs local search states and tests controlled interventions—such as tweaking constants, removing code components, or swapping models—to understand how score gains are achieved.
The paper's most striking finding is that about 30% of code lines added during search are byte-identical re-introductions of lines that were previously deleted, a deterministic cycling pattern present in nearly every run. Additionally, the majority of score improvements come from only a small subset of nine annotated edit types. This implies that many reported performance gains may not represent true algorithmic discovery, but rather re-tuning existing strategies, recombining internal knowledge, or overfitting to the evaluator. EvoTrace and EvoReplay offer a more diagnostic way to evaluate evolutionary coding agents beyond superficial benchmark scores.
- EvoTrace dataset covers 4 frameworks, 16 tasks, and 9 annotated edit types
- ~30% of added code lines are byte-identical re-additions of previously deleted lines
- Most benchmark gains come from a small subset of edit types, not novel algorithmic structure
Why It Matters
For professionals relying on AI-generated code, this cautions against equating benchmark scores with genuine algorithmic innovation.