Revisiting Model Stitching In the Foundation Model Era
A new method reliably stitches together different vision foundation models like CLIP and DINOv2, often improving performance.
A research team led by Zheda Mai has published a significant paper, 'Revisiting Model Stitching In the Foundation Model Era,' accepted at CVPR 2023. The work systematically investigates whether modern, heterogeneous Vision Foundation Models (VFMs)—trained on different data, with different objectives, and sometimes on different modalities (e.g., CLIP, DINOv2, SigLIP)—can be effectively combined. The researchers introduced a rigorous protocol testing various stitch points, layer types, training losses, and downstream tasks.
Their findings reveal that conventional stitching approaches often fail, but a simple feature-matching loss applied at the target model's penultimate layer makes these diverse VFMs reliably stitchable. Crucially, for deep stitch points, the resulting hybrid model can actually outperform both of its constituent models, adding only minimal computational overhead from a small 'stitch layer.' This challenges the prior understanding of stitching as merely a compatibility probe.
Building on this, the team proposes the VFM Stitch Tree (VST), a framework for sharing early layers across multiple VFMs while keeping their specialized later layers distinct. This architecture enables a controllable trade-off between accuracy and latency, which is particularly valuable for complex multimodal systems like large language models that rely on multiple visual encoders. The study effectively elevates model stitching from an academic diagnostic to a practical engineering recipe for building more capable and efficient AI systems by integrating complementary model strengths.
- A simple feature-matching loss enables reliable stitching of diverse Vision Foundation Models (CLIP, DINOv2, SigLIP), where previous methods failed.
- At deep stitch points, the combined model can exceed the performance of either original model with only a small inference cost for the stitch layer.
- The proposed VFM Stitch Tree (VST) allows sharing early layers across models, creating an accuracy-latency trade-off useful for multimodal AI systems.
Why It Matters
Provides a practical method to combine best-in-class AI components, potentially creating superior hybrid models without training from scratch.