Research & Papers

IBM's Granite Embedding R2 models handle 200+ languages and code

New open-source embedding models support 32K token context, 64x larger than before.

Deep Dive

IBM Research has introduced the Granite Embedding Multilingual R2 models, a new family of encoder-based embedding models designed for enterprise-scale dense retrieval across over 200 languages. This release extends the earlier English-focused R2 models by adding enhanced support for 52 natural languages and programming code, and a massive 32,768-token context window—a 64x expansion over the previous R1 generation. The models are built on the ModernBERT architecture with an expanded multilingual vocabulary, enabling state-of-the-art performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval tasks.

The model family includes two bi-encoder variants: a 311M-parameter full-size model and a 97M-parameter compact model created through model pruning and vocabulary selection. The compact version achieves the highest retrieval score of any open multilingual embedding model under 100M parameters, making it highly suitable for resource-constrained environments. The full-size model supports Matryoshka Representation Learning, allowing users to flexibly adjust embedding dimensionality without retraining. Both models are released under the Apache 2.0 license and trained on enterprise-appropriate data with governance oversight, making them ideal for responsible research and commercial adoption.

Key Points
  • Supports 200+ languages with enhanced coverage for 52 languages and programming code.
  • 32,768-token context window, a 64x increase over the R1 generation.
  • 97M-parameter compact model achieves highest retrieval score among open sub-100M multilingual embedding models.

Why It Matters

Enterprise teams gain powerful, open-source multilingual embeddings for search, RAG, and code retrieval at scale.