Research & Papers

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

New method decouples data ratio selection from training, letting you optimize models after the fact.

Deep Dive

Researchers Haiyue Song and Masao Utiyama have introduced OptiMer, a paradigm-shifting technique for continual pre-training (CPT) of large language models. Traditional CPT requires fixing the mixture ratios of different datasets (e.g., Japanese text, math problems, code) before a costly, weeks-long training run begins. A suboptimal ratio wastes immense compute. OptiMer decouples this problem: it first trains one small CPT model on each individual dataset. From each model, it extracts a 'distribution vector'—a mathematical representation of the parameter shift induced by that specific data.

The core innovation is post-hoc optimization. After extracting these vectors, OptiMer uses Bayesian optimization to search for the optimal weights to merge them, targeting a specific objective like Japanese translation or code generation. This search is dramatically cheaper than tuning data mixtures through full retraining. In experiments adapting Google's Gemma 3 27B model to languages (Japanese, Chinese) and domains (Math, Code), OptiMer consistently matched or outperformed conventional data-mixing CPT. Crucially, it achieved this with a 15 to 35 times reduction in the computational cost of finding the optimal setup.

Key findings reveal OptiMer's flexibility. First, the optimized merging weights can be interpreted as ideal data mixture ratios, providing a recipe to improve traditional CPT if desired. Second, and more powerfully, the same pool of distribution vectors can be re-optimized for different end goals without any additional training. This allows teams to produce a variety of target-tailored models (e.g., one for legal Japanese, another for scientific code) on-demand from a single set of initial training runs. The work fundamentally reframes data mixture selection from a rigid, pre-commitment decision into a flexible, post-training optimization problem.

Key Points
  • Decouples data ratio tuning from training, using post-hoc Bayesian optimization on extracted 'distribution vectors'.
  • Achieved equal or better performance vs. data mixing on Gemma 3 27B with 15-35x lower computational search cost.
  • Enables creation of multiple target-tailored models from a single vector pool without retraining.

Why It Matters

Dramatically reduces the cost and time of adapting foundation models to new languages and specialized domains.