OP-Mix: New algorithm unifies data mixing across all LM training phases
Cuts compute by 95% while improving perplexity by 6.3%
Data mixing — deciding how to combine different sources or types of data — is a critical but fragmented problem in language model training. Existing methods either rely on smaller proxy models tied to a single training phase, assume a fixed domain set, or offer no principled guidance for continual learning. In a new paper on arXiv, researchers from NYU and Google propose OP-Mix (On-Policy Mix), a unified algorithm that treats data mixing as an online decision-making problem recurring throughout the entire LM lifecycle.
The key insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters (LoRAs) trained directly on the current model. This eliminates the need for separate proxy models and ensures the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures using a fraction of the compute of baselines. Specifically, in pretraining it improves average perplexity by 6.3% over training without mixing. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively.
- OP-Mix uses low-rank adapters (LoRAs) to simulate data mixtures on the current model, removing the need for proxy models
- Improves pretraining perplexity by 6.3% over no data mixing
- Matches retraining and on-policy distillation performance with 66% and 95% less compute in continual learning settings
Why It Matters
A single, compute-efficient data mixing method could streamline LM training from pretraining through fine-tuning.