Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
Treats training data like code to diagnose and fix model failures...
A team of researchers led by Chenkai Pan and Cheng Tan has published a paper introducing "Programming with Data," a novel framework that reimagines the relationship between training data and large language model (LLM) behavior. The core insight is to treat a structured knowledge representation extracted from the source corpus as the shared foundation for both training data and evaluation. This allows the entire data-engineering lifecycle to be mapped onto the software development lifecycle: training data becomes source code, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging.
The framework enables model failures to be decomposed into concept-level gaps and reasoning-chain breaks, which can be traced back to specific deficiencies in the data and repaired through targeted patches. The researchers validated their approach across sixteen disciplines, including natural sciences, engineering, biomedicine, and social sciences, and released a structured knowledge base, benchmark suite, and training corpus as open resources. The work demonstrates that the relationship between training data and model behavior is structurally traceable and systematically repairable, establishing a principled foundation for reliably engineering human expertise into language models.
- Maps data engineering onto software development: training data as code, model training as compilation, benchmarking as unit testing, data repair as debugging.
- Validated across 16 disciplines (natural sciences, engineering, biomedicine, social sciences) with consistent improvements across model scales and architectures.
- Open resources released: structured knowledge base, benchmark suite, and training corpus.
Why It Matters
Makes LLM training data debugging as systematic as code debugging, enabling reliable domain expertise transfer.