feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp
A 310B parameter model with only 15B active per token and full multimodal input.
Deep Dive
Xiaomi's MiMo V2.5 is a sparse Mixture-of-Experts model with 310B total parameters but only 15B activated per token. It supports up to 1M tokens of context and processes text, image, video, and audio via dedicated encoders (729M ViT, 261M audio transformer) plus a 329M-parameter multi-token prediction module.
Key Points
- 310B total / 15B activated parameters makes it one of the most efficient MoE models for its size.
- Supports up to 1M tokens of context, enabling very long document or video analysis.
- Includes dedicated vision (729M ViT) and audio (261M transformer) encoders for full multimodal input.
Why It Matters
Brings enterprise-grade multimodal MoE with ultra-long context to local inference, democratizing advanced AI.