Adds MoE support for q4_k, q5_k, q6_k quantizations on Adreno GPUs via OpenCL?

Adds MoE support for q4_k, q5_k, q6_k quantizations on Adreno GPUs via OpenCL

Contributed by Qualcomm engineer Li He, enabling mobile-friendly local inference?

Contributed by Qualcomm engineer Li He, enabling mobile-friendly local inference

llama.cpp now supports MoE on Android devices running Adreno GPUs, expanding edge AI capabilities?

llama.cpp now supports MoE on Android devices running Adreno GPUs, expanding edge AI capabilities

Developer Tools

llama.cpp b9244 brings MoE support to Adreno GPUs via OpenCL

llama.cpp Releases May 20, 2026

⚡MoE models now run on mobile GPUs with 4-6 bit quantization.

Deep Dive

The open-source llama.cpp project, which has garnered over 112k GitHub stars and 18.5k forks, released version b9244 on May 20. This release introduces Mixture of Experts (MoE) support for three key quantization formats — q4_k, q5_k, and q6_k — specifically on Qualcomm Adreno GPUs using OpenCL. The implementation was contributed by Li He at Qualcomm, suggesting deep collaboration to optimize local inference on mobile hardware.

MoE models like Mixtral 8x7B use multiple "expert" sub-networks activated per token, drastically reducing compute while maintaining quality. With this update, llama.cpp can now handle these models efficiently on Adreno GPUs, which power most Android flagship devices. The quantization levels (q4_k, q5_k, q6_k) represent bits per weight, allowing users to trade off precision for speed and memory. Combined with OpenCL parallel execution, this makes state-of-the-art LLMs feasible on smartphones and tablets. The release also includes the usual cross-platform builds: macOS (Apple Silicon, Intel, iOS), Linux (x86, ARM, s390x with Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler. This positions llama.cpp as the leading tool for running MoE models offline on consumer devices.

Key Points

Adds MoE support for q4_k, q5_k, q6_k quantizations on Adreno GPUs via OpenCL
Contributed by Qualcomm engineer Li He, enabling mobile-friendly local inference
llama.cpp now supports MoE on Android devices running Adreno GPUs, expanding edge AI capabilities

Why It Matters

MoE models on mobile hardware mean powerful AI assistants can run fully offline.

Read Original Article

llama.cpp b9244 brings MoE support to Adreno GPUs via OpenCL

Why It Matters

Related Articles

🚀 Stay Ahead in AI