Llama.cpp PR #23449 generalizes Adreno MoE kernels across 20+ platforms
Mixture-of-Experts inference now runs on Qualcomm Adreno GPUs from phones to servers.
Llama.cpp's latest pull request (#23449) brings a major generalization to its OpenCL backend, specifically targeting Mixture-of-Experts (MoE) kernels for Qualcomm Adreno GPUs. Previously limited in scope, the update now enables Adreno-based MoE execution across an extensive list of platforms, including macOS (both Apple Silicon and Intel), iOS, Linux (x64, arm64, s390x with various backends like Vulkan, ROCm, OpenVINO, SYCL), Android (arm64 CPU), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), and openEuler (x86 and aarch64 with ACL Graph support). This broad compatibility ensures that developers can leverage MoE models on everything from mobile phones to high-end desktops and servers.
The generalization is crucial for running large language models that use MoE architectures—where only a subset of parameters is activated per token—on widely available Adreno GPUs. By utilizing OpenCL, the implementation avoids vendor lock-in and works across operating systems without proprietary drivers. This means faster inference on Qualcomm-powered devices like Android phones and Windows laptops, as well as on custom Linux setups. The PR also integrates with KleidiAI on Apple Silicon for further optimization. As MoE models become more popular in edge AI, this update significantly lowers the barrier for deployment.
- Generalizes OpenCL MoE kernels for Qualcomm Adreno GPUs across 20+ platform configurations.
- Added support for macOS (Apple Silicon & Intel), iOS, Linux (x64, arm64, s390x), Android, Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), and openEuler.
- Enables efficient Mixture-of-Experts inference on diverse hardware, from mobile to server GPUs.
Why It Matters
MoE model deployment on Qualcomm GPUs becomes platform-agnostic, enabling faster edge inference across devices.