Developer Tools

b9070

llama.cpp Releases May 08, 2026

⚡New OpenCL kernel accelerates MoE models on Qualcomm mobile GPUs

Deep Dive

The latest llama.cpp release (b9070) adds an OpenCL kernel for q4_0 Mixed-Expert (MoE) Generalized Matrix Multiply (GEMM) specifically optimized for Qualcomm Adreno GPUs. This targets the increasingly popular Mixture-of-Experts architecture, where only a subset of model parameters are activated per token, combined with 4-bit quantization (Q4_0) to reduce memory and compute. The new GEMM kernel improves inference speed on Adreno chipsets (common in Android phones), enabling efficient execution of quantized MoE models like Mixtral or DeepSeek on mobile hardware. The update includes a sanity check for the CLC pass, whitespace fixes, and removal of unused code.

Beyond the Adreno optimization, the release packages binaries for a wide range of platforms: macOS (Apple Silicon, Intel, KleidiAI), Linux (x64, ARM, Vulkan, ROCm, OpenVINO, SYCL), Windows (CPU, CUDA 12/13, Vulkan, HIP), Android (ARM64 CPU), and openEuler (x86 and aarch64). This comprehensive build matrix ensures developers can deploy llama.cpp across diverse environments. The focus on mobile GPU acceleration reflects a broader trend in AI inference—bringing large language models to edge devices without cloud dependency. For developers building on-device AI applications, this update provides a tangible performance boost for MoE models on Qualcomm-powered hardware.

Key Points

New OpenCL kernel for q4_0 MoE GEMM optimized for Qualcomm Adreno GPUs
Enables faster inference of mixture-of-experts models with 4-bit quantization on mobile devices
Supports 15+ platform builds including macOS, Linux, Windows, Android, and openEuler

Why It Matters

Brings efficient LLM inference to smartphones via Adreno GPU, enabling on-device AI

Read Original Article

b9070

Why It Matters

Stay Ahead in AI