llama.cpp b9113 adds Q4_1 MoE support for Adreno GPUs
Run Mixture-of-Experts models faster on Qualcomm Adreno GPUs with new quantization…
Deep Dive
Llama.cpp release b9113 (by ggml-org) introduces OpenCL support for Q4_1 quantized Mixture-of-Experts (MoE) layers on Adreno GPUs. This allows running large MoE models like Mixtral more efficiently on mobile and edge devices. The update includes sanity checks, removed unnecessary code, and shape-specific optimizations for Adreno. Builds available for macOS, Linux, Windows, Android, iOS, and more.
Key Points
- Adds OpenCL support for Q4_1 quantized MoE layers on Adreno GPUs
- Includes shape-specific optimizations and CLC pass for MoE inference
- Builds available for 20+ platform variants including mobile and desktop
Why It Matters
Enables efficient MoE model inference on mobile GPUs, broadening real-world deployment of large language models.