MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
New method finds optimal data mixtures, cutting training steps by 2x and boosting Qwen2-7B performance up to 17.6%.
A research team from the University of Washington and the Allen Institute for AI has introduced MixAtlas, a novel method for optimizing data mixtures during the midtraining phase of multimodal large language models (MLLMs). Unlike current approaches that tune data mixtures along a single dimension like format or task, MixAtlas decomposes training corpora along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). This creates a more granular search space for finding optimal training recipes.
Using small proxy models like Qwen2-0.5B paired with a Gaussian-process surrogate and GP-UCB acquisition function, MixAtlas efficiently searches this mixture space with the same computational budget as regression-based baselines but finds significantly better-performing combinations. The method was evaluated on 10 benchmarks spanning visual understanding, document reasoning, and multimodal reasoning, demonstrating substantial improvements when recipes were transferred to larger models.
On Qwen2-7B, optimized mixtures improved average performance by 8.5% to 17.6% over the strongest baseline, while on Qwen2.5-7B, gains were 1.0% to 3.3%. Both settings reached baseline-equivalent training loss in up to 2 times fewer steps, indicating improved sample efficiency. Crucially, recipes discovered using 0.5B proxy models successfully transferred to 7B-scale training across Qwen model families, validating the scalability of the approach.
The research addresses a critical gap in multimodal training where data mixture optimization has remained largely unexplored despite its potential for improving sample efficiency and downstream generalization. MixAtlas produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora, offering a systematic framework for data curation that could significantly reduce training costs and improve model performance across multimodal tasks.
- Optimized data mixtures improved Qwen2-7B performance by 8.5%-17.6% across 10 multimodal benchmarks
- Achieved baseline-equivalent training loss in up to 2x fewer steps, doubling sample efficiency
- Recipes discovered on 0.5B proxy models successfully transferred to 7B-scale Qwen models
Why It Matters
Dramatically reduces training costs for multimodal AI while improving performance, making advanced vision-language models more accessible.