A unified 119B parameter MoE model with 128 experts, 256k context, and multimodal capabilities.

Multilingual embedding model with 8192-token context and 5 task-specific LoRA adapters for zero-latency switching.

Video segmentation model that is 5-10x faster, hitting 160 FPS with a novel lightweight architecture.

Developer Tools

Hugging Face Transformers v5.4 adds Mistral 4, Jina v3, and 5x faster video AI

HF Transformers March 27, 2026

⚡The update integrates seven major new models, including a 119B parameter MoE and a robot action model.

Deep Dive

Hugging Face has launched version 5.4.0 of its flagship Transformers library, a major release integrating seven cutting-edge AI models from leading research teams. The headliner is Mistral AI's 'Mistral 4', a massive 119 billion parameter Mixture-of-Experts (MoE) model that unifies instruction-following, reasoning, and coding capabilities into a single architecture with 128 experts. It features a 256k context window and multimodal image-text processing. Alongside it, Jina AI's 'Jina-Embeddings-v3' brings a powerful, multilingual text embedding model based on XLM-RoBERTa, supporting sequences up to 8192 tokens and featuring five built-in Task-Specific LoRA adapters for retrieval and classification without latency penalties.

The release significantly boosts capabilities in computer vision and robotics. The 'VidEoMT' model for online video segmentation achieves a remarkable 5x to 10x speed increase over previous methods, running at up to 160 FPS, by using a novel query propagation mechanism. For robotics, 'PI0' is a vision-language-action model that uses flow matching to generate robot actions for complex tasks like laundry folding and box assembly. The update also includes 'UVDoc' for document image correction and Baidu's 'SLANeXt' series for high-accuracy table structure recognition, rounding out a comprehensive suite for multimodal AI development.

Key Points

Mistral 4: A unified 119B parameter MoE model with 128 experts, 256k context, and multimodal capabilities.
Jina Embeddings v3: Multilingual embedding model with 8192-token context and 5 task-specific LoRA adapters for zero-latency switching.
VidEoMT: Video segmentation model that is 5-10x faster, hitting 160 FPS with a novel lightweight architecture.

Why It Matters

This release democratizes access to frontier AI models, allowing developers to instantly build advanced multimodal and robotic applications.

Read Original Article

Hugging Face Transformers v5.4 adds Mistral 4, Jina v3, and 5x faster video AI

Why It Matters

Related Articles

🚀 Stay Ahead in AI