Research & Papers

PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

arXiv cs.CV April 28, 2026

⚡New technique combines vision-language models without retraining from scratch.

Deep Dive

Researchers from the Chinese Academy of Sciences (CASIA) and Microsoft have introduced PivotMerge, a novel framework that enables merging multiple multimodal large language models (MLLMs) trained on different datasets into a single, more capable model. Published on arXiv, the work addresses a key limitation of existing model merging techniques, which primarily focus on post-finetuning stages. PivotMerge targets the pre-training phase, where models learn cross-modal alignment—connecting visual and textual representations—from heterogeneous data sources. This is a critical step for building robust vision-language models like CLIP or GPT-4V.

PivotMerge tackles two core challenges: cross-domain parameter interference, where conflicting updates from different datasets degrade performance, and layer-wise alignment contribution disparity, where different neural network layers contribute unevenly to alignment. The framework introduces two key components: Shared-space Decomposition and Filtering, which isolates common alignment patterns while suppressing conflicting directions, and Alignment-guided Layer-wise Merging, which assigns custom weights to each layer based on its alignment importance. In extensive evaluations on CC12M-based multimodal benchmarks, PivotMerge consistently outperformed existing baselines, demonstrating both effectiveness and generalization across tasks like image captioning and visual question answering.

Key Points

PivotMerge from CASIA and Microsoft merges multimodal models trained on different datasets without full retraining.
It uses Shared-space Decomposition to isolate common alignment patterns and Filtering to suppress conflicting parameter updates.
Alignment-guided Layer-wise Merging assigns per-layer weights based on each layer's contribution to cross-modal alignment.

Why It Matters

Merging specialized AI models cheaply could accelerate multimodal AI development for vision-language tasks.

Read Original Article

PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

Why It Matters

Stay Ahead in AI