b9093
Open-source LLM runner now supports Sarvam's Mixture of Experts models locally.
The llama.cpp project, known for its efficient C++ implementation of large language model inference, has tagged version b9093. This release introduces support for the Sarvam MoE architecture, a mixture-of-experts model design developed by Sarvam AI. Mixture-of-experts models route inputs to specialized subnetworks, enabling larger models with lower inference costs. By adding this architecture, llama.cpp broadens the range of models that can run locally on consumer hardware.
The b9093 release is accompanied by extensive cross-platform builds. Precompiled binaries are available for macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x) with various acceleration backends including Vulkan, ROCm, OpenVINO, and SYCL. Windows users get CPU, CUDA, and Vulkan builds, while Android and iOS are also supported. This ensures that developers and researchers can deploy Sarvam MoE models across desktop, server, and mobile environments without manual compilation.
This update is significant for the open-source AI community. Llama.cpp currently boasts over 109k stars and 18k forks, reflecting its dominant role in local LLM deployment. The addition of Sarvam MoE follows recent support for other architectures like Falcon, MPT, and DBRX. For users, this means they can experiment with Sarvam's Hindi-English bilingual models and other MoE variants directly on their own machines, reducing reliance on cloud APIs and improving privacy.
- Version b9093 adds support for the Sarvam MoE (Mixture of Experts) architecture.
- Pre-built binaries available for Apple Silicon, Windows (CPU/CUDA/Vulkan), Linux, Android, and iOS.
- Llama.cpp is the most-starred LLM inference library with 109k GitHub stars and 18k forks.
Why It Matters
Expands local AI inference to new MoE models, giving developers more flexibility and privacy.