Developer Tools

Study: Developers spend more time reviewing LLM-labeled code but not more carefully

Eye-tracking reveals developers linger on AI code but don't check it more thoroughly.

Deep Dive

A new study published on arXiv (and accepted at ASE 2026) investigates how developers actually behave when reviewing code explicitly labeled as generated by large language models (LLMs). The researchers from Chalmers University of Technology and University of Gothenburg conducted a Wizard-of-Oz experiment where participants reviewed code with and without an 'LLM-generated' label while their eye movements were tracked. The study collected both quantitative eye-tracking data and qualitative feedback from exit interviews.

The key finding: developers spent more time fixating on LLM-labeled code, but their scrutiny — measured by thoroughness of review — did not increase. In other words, the label made them pay more visual attention but didn't lead to more careful checking. Practitioners also adapted their review strategies, assessing code based on specific criteria like logical correctness and using the original prompt as a guide. This suggests that while developers trust LLM code less, they may not compensate with deeper analysis. The authors recommend software companies revisit AI-assisted development policies and design tools that integrate prompts as artifacts to support more effective review of LLM-generated code.

Key Points
  • Eye-tracking data shows developers spend more time (measured by fixations) on LLM-labeled code but review thoroughness remains unchanged
  • Developers adapt by evaluating LLM code on specific criteria like logical correctness and using prompts to guide review
  • Study reveals a gap between developers' intentions and actual behavior; suggests companies need better AI policies for code review

Why It Matters

LLM code is everywhere; this study reveals developers aren't actually checking it more carefully despite spending more time.

📬 Get the top 10 AI stories daily