OP-Mix uses low-rank adapters (LoRAs) to simulate data mixtures on the current model, removing the need for proxy models?

OP-Mix uses low-rank adapters (LoRAs) to simulate data mixtures on the current model, removing the need for proxy models

Improves pretraining perplexity by 6.3% over no data mixing?

Improves pretraining perplexity by 6.3% over no data mixing

Matches retraining and on-policy distillation performance with 66% and 95% less compute in continual learning settings?

Matches retraining and on-policy distillation performance with 66% and 95% less compute in continual learning settings

Research & Papers

OP-Mix: New algorithm unifies data mixing across all LM training phases

arXiv cs.CL May 18, 2026

⚡Cuts compute by 95% while improving perplexity by 6.3%

Deep Dive

Data mixing — deciding how to combine different sources or types of data — is a critical but fragmented problem in language model training. Existing methods either rely on smaller proxy models tied to a single training phase, assume a fixed domain set, or offer no principled guidance for continual learning. In a new paper on arXiv, researchers from NYU and Google propose OP-Mix (On-Policy Mix), a unified algorithm that treats data mixing as an online decision-making problem recurring throughout the entire LM lifecycle.

The key insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters (LoRAs) trained directly on the current model. This eliminates the need for separate proxy models and ensures the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures using a fraction of the compute of baselines. Specifically, in pretraining it improves average perplexity by 6.3% over training without mixing. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively.

Key Points

OP-Mix uses low-rank adapters (LoRAs) to simulate data mixtures on the current model, removing the need for proxy models
Improves pretraining perplexity by 6.3% over no data mixing
Matches retraining and on-policy distillation performance with 66% and 95% less compute in continual learning settings

Why It Matters

A single, compute-efficient data mixing method could streamline LM training from pretraining through fine-tuning.

Read Original Article

OP-Mix: New algorithm unifies data mixing across all LM training phases

Why It Matters

Related Articles

🚀 Stay Ahead in AI