Research & Papers

Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

A new study shows rewriting high-quality data with a 7B model creates 40B tokens that significantly improve Portuguese AI performance.

Deep Dive

A team of researchers including Thales Sales Almeida has published a groundbreaking study on arXiv titled 'Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining.' The research provides the first controlled analysis of how synthetic data generation through document rewriting interacts with the quality of the source material, specifically for Portuguese language models. Starting with the ClassiCC-PT corpus annotated with STEM and Educational quality scores, the team constructed two 10B-token subsets at different quality levels. They then used a 7B instruction-tuned model to rewrite each subset into four distinct styles, generating approximately 40B tokens of synthetic data per condition.

The researchers trained two English-centric base models (1.1B and 7B parameters) on each data condition and evaluated them on PoETa V2, a comprehensive 44-task Portuguese benchmark. The results revealed a striking scale-dependent effect: at the 7B parameter scale, rewriting high-quality source data yielded a significant +3.4 Normalized Performance Metric (NPM) gain over using the same data unmodified. In contrast, rewriting low-quality data provided only a marginal +0.5 NPM improvement. Interestingly, at the smaller 1.1B scale, this quality interaction was weaker, with unmodified low-quality data performing comparably to rewritten high-quality data.

This study fundamentally shifts how we think about synthetic data generation for AI training. The findings demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for careful data curation. The technique amplifies the value of existing high-quality datasets but cannot compensate for poor source material, especially in larger models. This has major implications for AI development pipelines, suggesting that teams should prioritize sourcing quality data before investing in synthetic augmentation techniques.

Key Points
  • Rewriting high-quality Portuguese data with a 7B model yielded a +3.4 NPM gain on the 44-task PoETa V2 benchmark
  • The study generated 40B tokens of synthetic data per condition by rewriting 10B-token subsets into four distinct styles
  • The effect is scale-dependent: at 1.1B parameters, low-quality unmodified data performed comparably to rewritten high-quality data

Why It Matters

This research provides a data-driven framework for prioritizing quality over quantity in synthetic data generation for non-English AI models.