Multimodal Embedding & Reranker Models with Sentence Transformers
The Python library now lets you encode and compare text, images, audio, and video in a shared vector space.
The Sentence Transformers Python library, maintained by developer Tom Aarsen, has released version 5.4 with groundbreaking multimodal capabilities. Previously focused on text embeddings, the library now supports encoding and comparing text, images, audio, and video using the same familiar API. This is enabled through new multimodal embedding models like Qwen/Qwen3-VL-Embedding-2B, which map different modalities into a shared 2048-dimensional vector space. The update also introduces multimodal reranker models that can score relevance between mixed-modality pairs, such as text queries against image documents.
Developers can install the new features with pip install commands specifying the needed modalities (image, audio, or video). The library automatically detects modality support, requiring minimal code changes to existing workflows. For example, model.encode() now accepts image URLs, local file paths, or PIL Image objects alongside text. This enables practical applications like cross-modal similarity search where text queries can be matched against image databases, or building multimodal RAG pipelines that combine text and visual information. The models require significant GPU resources (8-20GB VRAM), with text-only or CLIP models recommended for CPU inference.
- Version 5.4 adds multimodal embedding for text, images, audio, and video using shared vector space
- Supports models like Qwen3-VL-2B requiring 8-20GB VRAM with 2048-dimensional embeddings
- Enables cross-modal search, visual document retrieval, and multimodal RAG pipelines
Why It Matters
Developers can now build AI applications that understand and connect information across text, images, audio, and video simultaneously.