MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation
New method achieves state-of-the-art performance while using 99% fewer parameters than existing approaches.
A research team from UC Santa Barbara and UC Berkeley has introduced MMLoP (Multi-Modal Low-Rank Prompting), a breakthrough framework for efficiently adapting large vision-language models like CLIP to specific tasks. The innovation addresses a critical limitation in current prompt learning methods: while extending prompts to both vision and text encoders across multiple transformer layers significantly improves performance, it dramatically increases trainable parameters from thousands to millions, abandoning the parameter efficiency that makes prompt tuning attractive. MMLoP solves this by achieving deep multi-modal prompting with only 11.5K parameters—comparable to early text-only methods like CoOp—while maintaining competitive accuracy.
The technical breakthrough comes from three key components: a low-rank factorization that parameterizes vision and text prompts at each transformer layer, serving as an implicit regularizer against overfitting; a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features; and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters. This enables practical deployment of customized vision-language AI in resource-constrained environments where computational efficiency matters.
- Achieves deep multi-modal prompting with only 11.5K trainable parameters (99% reduction vs. state-of-the-art)
- Maintains 79.70% harmonic mean on base-to-novel generalization across 11 diverse datasets
- Uses low-rank factorization and cross-modal alignment to prevent overfitting on small datasets
Why It Matters
Enables efficient customization of vision-language AI for specialized applications without expensive retraining or massive parameter updates.