TWN adaptive reasoning slashes tokens 50% and improves multimodal retrieval
Only 3-5% extra parameters, state-of-the-art on 78 MMEB-V2 tasks
Multimodal large language models (MLLMs) are increasingly used for embeddings, and recent methods add chain-of-thought (CoT) reasoning to improve retrieval. However, these approaches are costly—they require separate reasoner and embedder models with substantial parameters, and they generate CoT indiscriminately even for simple inputs where reasoning can degrade performance. To address this, Longxiang Zhang and colleagues propose Think When Needed (TWN), a unified framework that attaches lightweight reasoning and embedding adapters to a shared frozen backbone via a dual-LoRA architecture. This design keeps parameters close to a single model by detaching gradients at the adapter interface to mitigate conflicts during joint optimization.
TWN's key innovation is an adaptive think mechanism: a self-supervised routing gate decides per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and improve retrieval quality. The authors further explore embedding-guided reinforcement learning to optimize CoT beyond supervised training. On the 78 tasks of the MMEB-V2 benchmark, TWN achieves state-of-the-art embedding quality while requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to full generative mode, demonstrating efficiency without sacrificing accuracy.
- Dual-LoRA architecture adds only 3-5% additional parameters to a frozen MLLM backbone
- Adaptive think mechanism reduces reasoning tokens by up to 50% by skipping unnecessary CoT
- Achieves state-of-the-art on all 78 tasks of MMEB-V2 benchmark
Why It Matters
Efficient adaptive reasoning in embeddings could unlock cheaper, faster retrieval for real-world multimodal applications.