Developer Tools

b9006

New release brings 4-bit MoE inference to Qualcomm GPUs for faster edge AI.

Deep Dive

The open-source llama.cpp project has released version b9006, featuring a critical optimization for Qualcomm Adreno GPUs. The update introduces a specialized OpenCL CLC kernel for Mixture-of-Experts (MoE) models with MxFP4 (mixed-precision FP4). This allows MoE inference to run on Adreno hardware with improved efficiency, including a router reorder step that happens directly on the GPU. The commit also includes various code cleanups, removal of unnecessary headers and asserts, and a precision fix. The work is credited to collaboration with Li He from Qualcomm.

This release underscores the growing push to bring large language model inference to edge devices. By targeting Adreno GPUs—commonly found in Qualcomm Snapdragon mobile chips—llama.cpp enables faster and more memory-efficient execution of MoE architectures, which are typically heavier on computation. The update is part of a broader effort across multiple platforms (macOS Apple Silicon, Linux, Windows, Android, and more) to support diverse hardware backends. While no specific speed benchmarks are provided in the release notes, mixed-precision (MxFP4) techniques can significantly reduce memory bandwidth and power consumption, making sophisticated AI models viable on phones, tablets, and other portable hardware.

Key Points
  • Adreno-specific MoE MxFP4 CLC kernel added to leverage Qualcomm GPU hardware
  • Router reordering now performed on GPU for lower latency
  • Multi-platform builds cover macOS, Linux, Windows, Android, and openEuler

Why It Matters

Edge AI gets a boost: larger MoE models now run efficiently on Qualcomm-powered devices.