Developer Tools

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

arXiv cs.SE April 29, 2026

⚡Treats training data like code to diagnose and fix model failures...

Deep Dive

A team of researchers led by Chenkai Pan and Cheng Tan has published a paper introducing "Programming with Data," a novel framework that reimagines the relationship between training data and large language model (LLM) behavior. The core insight is to treat a structured knowledge representation extracted from the source corpus as the shared foundation for both training data and evaluation. This allows the entire data-engineering lifecycle to be mapped onto the software development lifecycle: training data becomes source code, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging.

The framework enables model failures to be decomposed into concept-level gaps and reasoning-chain breaks, which can be traced back to specific deficiencies in the data and repaired through targeted patches. The researchers validated their approach across sixteen disciplines, including natural sciences, engineering, biomedicine, and social sciences, and released a structured knowledge base, benchmark suite, and training corpus as open resources. The work demonstrates that the relationship between training data and model behavior is structurally traceable and systematically repairable, establishing a principled foundation for reliably engineering human expertise into language models.

Key Points

Maps data engineering onto software development: training data as code, model training as compilation, benchmarking as unit testing, data repair as debugging.
Validated across 16 disciplines (natural sciences, engineering, biomedicine, social sciences) with consistent improvements across model scales and architectures.
Open resources released: structured knowledge base, benchmark suite, and training corpus.

Why It Matters

Makes LLM training data debugging as systematic as code debugging, enabling reliable domain expertise transfer.

Read Original Article

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Why It Matters

Stay Ahead in AI