Developer Tools

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

18 causal mechanisms link training data flaws to code generation defects in LLMs

Deep Dive

Large language models (LLMs) frequently produce defective code—ranging from logical bugs to security vulnerabilities. While these failures have often been treated as model-level limitations, a new systematic review by Kaifeng He and colleagues from multiple institutions traces their root causes directly to imperfections in training data. The paper, published on arXiv in May 2026, analyzes 114 primary studies and establishes a unified taxonomy that categorizes generated code quality issues across nine dimensions (e.g., correctness, security, efficiency) and training data quality issues into code-related and non-code attributes (e.g., duplicate code, outdated APIs, noisy examples). The authors formalize a causal framework detailing 18 typical propagation mechanisms—concrete pathways through which specific data flaws manifest as specific code defects.

The reviewed literature also reveals a decisive methodological shift: quality assurance is moving away from heuristic-based, post-generation filtering toward proactive, data-centric governance and closed-loop repair. This means instead of patching outputs after generation, the community is investing in curating cleaner training corpora, detecting data issues before training, and continuously improving data pipelines based on model behavior. The authors identify open challenges such as automated data defect detection at scale, dynamic evaluation of data quality over time, and building closed-loop systems where generation failures feed back into data improvements. For practitioners, this study provides a comprehensive roadmap for building more reliable code LLMs through integrated data curation and continuous evaluation—a move that could significantly reduce the debugging burden on developers.

Key Points
  • Systematic review of 114 studies on LLM code generation quality issues
  • Taxonomy with 9 dimensions of code defects and 18 training data-to-defect propagation mechanisms
  • Shift from reactive post-generation filtering to proactive data-centric governance and closed-loop repair

Why It Matters

For teams using AI coding assistants, fixing training data quality is more impactful than patching faulty outputs.