Open Source

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Finetuning Qwen3-VL-Embedding-2B boosts visual document retrieval by 6.6%, beating models 4x larger.

Deep Dive

The Sentence Transformers library, developed by Tom Aarsen, has expanded its capabilities to include training and finetuning multimodal embedding and reranker models. This update allows developers to customize models that process text, images, audio, and video for specialized applications like retrieval-augmented generation (RAG) and semantic search. The breakthrough comes from enabling domain-specific finetuning, which dramatically improves performance on niche tasks where general-purpose models fall short.

In a concrete demonstration, Aarsen finetuned the Qwen/Qwen3-VL-Embedding-2B model for Visual Document Retrieval (VDR)—the task of finding relevant document pages (as images containing charts, tables, and layouts) based on text queries. The resulting custom model, tomaarsen/Qwen3-VL-Embedding-2B-vdr, achieved an NDCG@10 score of 0.947, a 6.6% improvement over the base model's 0.888. Remarkably, this finetuned 2B-parameter model outperformed all existing VDR models tested, including some that were four times larger, proving that targeted training trumps raw scale for specialized applications.

The training pipeline maintains the same components as text-only Sentence Transformers training—Model, Dataset, Loss Function, Training Arguments, Evaluator, and Trainer—but now handles multimodal data automatically. The Transformer module detects supported modalities from the model's processor, and developers can control preprocessing parameters like image resolution bounds through processor_kwargs. This approach enables organizations to create highly specialized AI systems for document understanding, e-commerce product matching, or multimedia content retrieval without needing massive computational resources.

Key Points
  • Finetuning Qwen3-VL-Embedding-2B improved visual document retrieval performance by 6.6% (NDCG@10 from 0.888 to 0.947)
  • The 2B-parameter finetuned model outperformed all tested VDR competitors, including models with 4x more parameters
  • Sentence Transformers now supports training pipelines for multimodal models handling text, images, audio, and video

Why It Matters

Enables organizations to build specialized AI retrieval systems that outperform larger general models, reducing computational costs.