llama.cpp adds Mellum architecture support in release b9482
Open-source LLM runtime now runs Mellum models across CPU, GPU, and mobile.
llama.cpp, the widely-used open-source C/C++ inference engine for large language models, just dropped release b9482, bringing official support for the Mellum architecture. Mellum is a new model design that promises improved efficiency or performance — though exact specs aren't detailed in the release notes. This addition makes llama.cpp one of the first runtimes to natively support Mellum, enabling developers and enthusiasts to experiment with cutting-edge AI on their own hardware.
This release is a significant milestone for local AI deployment. llama.cpp is known for running models on consumer-grade CPUs and GPUs without requiring massive cloud infrastructure. With Mellum support, users can now compile and run these new models on macOS Apple Silicon (including KleidiAI-accelerated builds), Linux (x64, arm64, s390x), Windows (CPU, CUDA 12/13, Vulkan, HIP), and even Android arm64. The release also includes infrastructure updates like downgrading transformers to 4.57.6 to fix CI and removing the huggingface_hub dependency, streamlining the build process.
- llama.cpp v0.0.0 (b9482) adds support for the new Mellum neural network architecture.
- Supports macOS (Apple Silicon & Intel), Linux, Windows (CPU, CUDA 12/13, Vulkan, HIP), and Android arm64.
- Release signed with GitHub verified GPG key; includes CI fixes and dependency updates.
Why It Matters
Enables running next-gen Mellum models locally on consumer hardware with CPU/GPU acceleration.