llama.cpp b9244 brings MoE support to Adreno GPUs via OpenCL
MoE models now run on mobile GPUs with 4-6 bit quantization.
The open-source llama.cpp project, which has garnered over 112k GitHub stars and 18.5k forks, released version b9244 on May 20. This release introduces Mixture of Experts (MoE) support for three key quantization formats — q4_k, q5_k, and q6_k — specifically on Qualcomm Adreno GPUs using OpenCL. The implementation was contributed by Li He at Qualcomm, suggesting deep collaboration to optimize local inference on mobile hardware.
MoE models like Mixtral 8x7B use multiple "expert" sub-networks activated per token, drastically reducing compute while maintaining quality. With this update, llama.cpp can now handle these models efficiently on Adreno GPUs, which power most Android flagship devices. The quantization levels (q4_k, q5_k, q6_k) represent bits per weight, allowing users to trade off precision for speed and memory. Combined with OpenCL parallel execution, this makes state-of-the-art LLMs feasible on smartphones and tablets. The release also includes the usual cross-platform builds: macOS (Apple Silicon, Intel, iOS), Linux (x86, ARM, s390x with Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler. This positions llama.cpp as the leading tool for running MoE models offline on consumer devices.
- Adds MoE support for q4_k, q5_k, q6_k quantizations on Adreno GPUs via OpenCL
- Contributed by Qualcomm engineer Li He, enabling mobile-friendly local inference
- llama.cpp now supports MoE on Android devices running Adreno GPUs, expanding edge AI capabilities
Why It Matters
MoE models on mobile hardware mean powerful AI assistants can run fully offline.