PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
New technique combines vision-language models without retraining from scratch.
Researchers from the Chinese Academy of Sciences (CASIA) and Microsoft have introduced PivotMerge, a novel framework that enables merging multiple multimodal large language models (MLLMs) trained on different datasets into a single, more capable model. Published on arXiv, the work addresses a key limitation of existing model merging techniques, which primarily focus on post-finetuning stages. PivotMerge targets the pre-training phase, where models learn cross-modal alignment—connecting visual and textual representations—from heterogeneous data sources. This is a critical step for building robust vision-language models like CLIP or GPT-4V.
PivotMerge tackles two core challenges: cross-domain parameter interference, where conflicting updates from different datasets degrade performance, and layer-wise alignment contribution disparity, where different neural network layers contribute unevenly to alignment. The framework introduces two key components: Shared-space Decomposition and Filtering, which isolates common alignment patterns while suppressing conflicting directions, and Alignment-guided Layer-wise Merging, which assigns custom weights to each layer based on its alignment importance. In extensive evaluations on CC12M-based multimodal benchmarks, PivotMerge consistently outperformed existing baselines, demonstrating both effectiveness and generalization across tasks like image captioning and visual question answering.
- PivotMerge from CASIA and Microsoft merges multimodal models trained on different datasets without full retraining.
- It uses Shared-space Decomposition to isolate common alignment patterns and Filtering to suppress conflicting parameter updates.
- Alignment-guided Layer-wise Merging assigns per-layer weights based on each layer's contribution to cross-modal alignment.
Why It Matters
Merging specialized AI models cheaply could accelerate multimodal AI development for vision-language tasks.