Developer Tools

LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda

Systematic review finds LLMs excel at automation but struggle with decision-support and reproducibility.

Deep Dive

A new systematic literature review by Victoria Gomes, Delaney Selb, Fabio Palomba, Rodrigo Spinola, and David Lo examines how Large Language Models (LLMs) are being used in Empirical Software Engineering (ESE). Analyzing 50 peer-reviewed papers from 12 top software engineering venues published between 2020 and 2025, the researchers mapped 69 distinct LLM-assisted tasks. These tasks are concentrated in mining software repositories and controlled experiments, with a focus on classification, filtering, and evaluation. The integration of LLMs is heavily automation-oriented, with limited use for decision-support or human-centered workflows.

The review highlights significant benefits, including improved efficiency and scalability in data processing and analysis. However, it also identifies critical limitations such as hallucinations, inconsistency, prompt sensitivity, and reproducibility issues. The authors note that reporting practices for reproducibility are often incomplete. Based on these findings, the study proposes a research agenda to guide the responsible adoption of LLMs in ESE, emphasizing the need for greater transparency and human-centered integration to move beyond purely automation-driven applications.

Key Points
  • Identified 69 LLM-assisted tasks across 50 studies from 12 top software engineering venues (2020-2025).
  • LLMs are primarily used for automation (classification, filtering, evaluation) with limited decision-support roles.
  • Key limitations include hallucinations, inconsistency, prompt sensitivity, and incomplete reproducibility reporting.

Why It Matters

This review provides a roadmap for responsibly integrating LLMs into software engineering research and practice.