Dual-LoRA architecture adds only 3-5% additional parameters to a frozen MLLM backbone?

Dual-LoRA architecture adds only 3-5% additional parameters to a frozen MLLM backbone

Adaptive think mechanism reduces reasoning tokens by up to 50% by skipping unnecessary CoT?

Adaptive think mechanism reduces reasoning tokens by up to 50% by skipping unnecessary CoT

Achieves state-of-the-art on all 78 tasks of MMEB-V2 benchmark?

Achieves state-of-the-art on all 78 tasks of MMEB-V2 benchmark

Research & Papers

TWN adaptive reasoning slashes tokens 50% and improves multimodal retrieval

arXiv cs.IR May 15, 2026

⚡Only 3-5% extra parameters, state-of-the-art on 78 MMEB-V2 tasks

Deep Dive

Multimodal large language models (MLLMs) are increasingly used for embeddings, and recent methods add chain-of-thought (CoT) reasoning to improve retrieval. However, these approaches are costly—they require separate reasoner and embedder models with substantial parameters, and they generate CoT indiscriminately even for simple inputs where reasoning can degrade performance. To address this, Longxiang Zhang and colleagues propose Think When Needed (TWN), a unified framework that attaches lightweight reasoning and embedding adapters to a shared frozen backbone via a dual-LoRA architecture. This design keeps parameters close to a single model by detaching gradients at the adapter interface to mitigate conflicts during joint optimization.

TWN's key innovation is an adaptive think mechanism: a self-supervised routing gate decides per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and improve retrieval quality. The authors further explore embedding-guided reinforcement learning to optimize CoT beyond supervised training. On the 78 tasks of the MMEB-V2 benchmark, TWN achieves state-of-the-art embedding quality while requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to full generative mode, demonstrating efficiency without sacrificing accuracy.

Key Points

Dual-LoRA architecture adds only 3-5% additional parameters to a frozen MLLM backbone
Adaptive think mechanism reduces reasoning tokens by up to 50% by skipping unnecessary CoT
Achieves state-of-the-art on all 78 tasks of MMEB-V2 benchmark

Why It Matters

Efficient adaptive reasoning in embeddings could unlock cheaper, faster retrieval for real-world multimodal applications.

Read Original Article

TWN adaptive reasoning slashes tokens 50% and improves multimodal retrieval

Why It Matters

Related Articles

🚀 Stay Ahead in AI