Developer Tools

llama.cpp adds support for IBM Granite multilingual embeddings (97M/311M)

Open-source LLM engine now runs IBM's Granite multilingual embedding models locally.

Deep Dive

The latest release (b9481) of llama.cpp adds support for IBM's Granite Embedding Multilingual R2 models, specifically the 97M and 311M parameter variants. To handle these models, the team added a version of the GPT-4o tokenizer with a fixed regex for better handling of diacritical marks and different token merging settings for the 97M model. The 311M variant reuses the Gemma4 tokenizer. Both models use a SwiGLU feed-forward network (FFN), which required adding a new GGUF key for .hidden_activation and a centralized ffn_op_type mapping. The release also includes converter code and model hashes.

This update is significant for developers deploying multilingual embeddings in resource-constrained environments. By adding SwiGLU FFN support and a dedicated hidden activation key, llama.cpp expands its model compatibility beyond standard GPT architectures. Users can now run IBM's state-of-the-art embedding models locally without proprietary hardware, enabling applications in retrieval-augmented generation (RAG), semantic search, and multilingual NLP pipelines. The inclusion of both 97M and 311M sizes offers a balance between speed and accuracy for different use cases.

Key Points
  • Support for two Granite Embedding Multilingual R2 models: 97M and 311M parameters
  • New fixed-regex GPT-4o tokenizer for better diacritical mark handling (97M) and Gemma4 tokenizer (311M)
  • Adds SwiGLU FFN support with new .hidden_activation GGUF key and centralized ffn_op_type mapping

Why It Matters

Local, efficient multilingual embeddings for RAG and search, now accessible on CPU via llama.cpp.