New study: LLM code summaries beat rules but lose reproducibility
Mixtral vs rule-based summarization on 150 Java design pattern files reveals a clear trade-off.
A new empirical study from Najam Nazar and Christoph Treude (arXiv:2605.21943) systematically compares deterministic and probabilistic approaches to automated code summarisation, specifically for design-pattern-centric Java code. Using 150 files from three open-source repositories covering nine design patterns, the researchers benchmarked three pipelines: a rule-based natural language generation (NLG) system, a Software Word Usage Model (SWUM)-based approach, and a probabilistic pipeline powered by the Mixtral LLM. Summaries were evaluated against human-written references using BERTScore and cosine similarity, plus a Llama 3 rubric across five dimensions: accuracy, conciseness, adequacy, code-context awareness, and design-pattern fidelity. Statistical tests included Wilcoxon signed-rank, Friedman with post-hoc corrections, and Spearman correlation.
The results reveal a clear trade-off: probabilistic summaries (LLM-based) show stronger semantic alignment and richer contextual coverage, but suffer from prompt-sensitivity and output variability across runs. Deterministic methods, while less semantically rich, produce more concise and fully reproducible outputs. The authors note that the relative trends remain stable despite LLM variability. They argue this is the first controlled comparison focused on intent-oriented design-pattern code summarisation, offering practical guidance: use probabilistic methods when semantic depth and context matter most (e.g., for understanding complex pattern interactions), and deterministic approaches when brevity and reproducibility are critical (e.g., in automated documentation pipelines or regulated environments).
- Tested 150 Java design-pattern files from 3 repos (9 patterns) with 3 pipelines: rule-based NLG, SWUM, and Mixtral LLM
- LLM summaries scored higher on BERTScore and rubric dimensions like code-context awareness, but had higher variability
- Deterministic methods (rule-based) were more concise and fully reproducible, suitable for automated documentation
Why It Matters
First controlled trade-off study helps teams pick code summarisation approach: LLMs for depth, rules for reproducibility.