Developer Tools

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

Study of 12 LLMs finds no consistent advantage over basic classifiers for systematic review screening.

Deep Dive

A new study from researchers Gilberto Sussumu Hida, Danilo Monteiro Ribeiro, and Erika Yahata systematically evaluates the performance and variability of large language models (LLMs) for study screening in software engineering systematic literature reviews (SLRs). The authors tested 12 different LLMs from four major providers—OpenAI, Google Gemini, Anthropic, and Meta’s Llama—alongside four classical machine learning models (Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes) on two real SLRs representing 518 total papers. They investigated three critical dimensions: LLM performance variability across runs and architectures, the impact of input metadata (abstract, title, keywords) on LLM accuracy, and whether LLMs offer a meaningful advantage over traditional classifiers under a shared protocol.

The results reveal substantial heterogeneity among LLMs and persistent non-determinism even when temperature was set to zero. Abstract availability proved decisive: removing it consistently degraded performance, while adding title and/or keywords to the abstract yielded no robust gains. Most importantly, when compared to classical classification models, the performance differences were not consistent enough to support any generalizable claim of LLM superiority. The authors caution that LLM adoption for evidence screening should be justified by operational and governance constraints—such as reproducibility, cost, and metadata availability—and must be supported by pilot validation and explicit reporting of variability and input configuration. This paper serves as a timely reminder that more complex models don’t automatically deliver better results for systematic review tasks.

Key Points
  • 12 LLMs from OpenAI, Google Gemini, Anthropic, and Meta Llama tested on 2 SLRs with 518 papers
  • LLMs exhibited high variability and non-determinism even at temperature zero; abstract removal degraded performance significantly
  • Classical models like Logistic Regression and Random Forest matched or exceeded LLMs, with no consistent advantage for modern LLMs

Why It Matters

Challenges the hype around LLMs for systematic reviews; researchers must validate with pilots and consider reproducibility, cost, and data constraints.