Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG
A new training method combines synthetic QAs and documents to break the RAG ceiling on long-document comprehension.
A team of researchers from institutions including the University of Washington and Stanford has published a paper introducing 'Synthetic Mixed Training' (SMT), a novel method designed to overcome the limitations of Retrieval-Augmented Generation (RAG) for teaching language models new knowledge. The core innovation is a dual-stream approach that trains models on a mixture of synthetic question-answer pairs and synthetic documents, which provide complementary learning signals. This method breaks the 'RAG ceiling,' where simply scaling up synthetic data volume or generator strength previously yielded diminishing returns. On the QuaLITY long-document reading comprehension benchmark, SMT achieved a 2.6% relative performance gain over RAG.
The researchers also developed 'Focal Rewriting,' a technique for generating synthetic documents that are explicitly conditioned on specific questions, thereby increasing diversity and improving the scaling curve. When combined, these techniques allowed a relatively small Llama 8B model to outperform RAG by 4.4% on QuaLITY. The method proved robust across multiple benchmarks—including LongHealth and FinanceBench—outperforming RAG in five out of six tested settings. Notably, when SMT was used in conjunction with RAG, it delivered a combined performance gain of 9.1%, demonstrating its potential as a complementary enhancement to existing retrieval systems.
- Synthetic Mixed Training combines synthetic QAs and documents for log-linear scaling, breaking the RAG performance ceiling.
- The 'Focal Rewriting' technique conditions document generation on specific questions, improving data diversity and scaling efficiency.
- A Llama 8B model trained with this method beat RAG by 4.4% on QuaLITY and achieved a 9.1% gain when combined with RAG.
Why It Matters
This enables smaller, more efficient models to internalize complex knowledge, reducing reliance on external retrieval systems for accurate long-form reasoning.