b9070
New OpenCL kernel accelerates MoE models on Qualcomm mobile GPUs
The latest llama.cpp release (b9070) adds an OpenCL kernel for q4_0 Mixed-Expert (MoE) Generalized Matrix Multiply (GEMM) specifically optimized for Qualcomm Adreno GPUs. This targets the increasingly popular Mixture-of-Experts architecture, where only a subset of model parameters are activated per token, combined with 4-bit quantization (Q4_0) to reduce memory and compute. The new GEMM kernel improves inference speed on Adreno chipsets (common in Android phones), enabling efficient execution of quantized MoE models like Mixtral or DeepSeek on mobile hardware. The update includes a sanity check for the CLC pass, whitespace fixes, and removal of unused code.
Beyond the Adreno optimization, the release packages binaries for a wide range of platforms: macOS (Apple Silicon, Intel, KleidiAI), Linux (x64, ARM, Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), Android (ARM64 CPU), and openEuler (x86 and aarch64). This comprehensive build matrix ensures developers can deploy llama.cpp across diverse environments. The focus on mobile GPU acceleration reflects a broader trend in AI inference—bringing large language models to edge devices without cloud dependency. For developers building on-device AI applications, this update provides a tangible performance boost for MoE models on Qualcomm-powered hardware.
- New OpenCL kernel for q4_0 MoE GEMM optimized for Qualcomm Adreno GPUs
- Enables faster inference of mixture-of-experts models with 4-bit quantization on mobile devices
- Supports 15+ platform builds including macOS, Linux, Windows, Android, and openEuler
Why It Matters
Brings efficient LLM inference to smartphones via Adreno GPU, enabling on-device AI