Developer Tools

llama.cpp b9142 adds q5_0 and q5_1 MoE support for Adreno GPUs

New quantization brings 5-bit MoE models to Qualcomm's mobile GPUs — faster, efficient local LLMs.

Deep Dive

llama.cpp, the popular open-source library for running large language models locally, has released version b9142 with significant GPU improvements. This release adds support for q5_0 and q5_1 quantization types for Mixture-of-Experts (MoE) models specifically targeting Adreno GPUs found in many Qualcomm-powered mobile devices. The update enables users to run MoE architectures—which combine multiple specialized sub-models—with 5-bit quantization, offering a balance between model quality and reduced memory footprint.

Beyond the core feature, the release includes multiple stability fixes: potential GPU memory leaks are addressed, and unused variable warnings are suppressed when building for non-Adreno targets. The library is distributed as pre-built binaries for all major platforms, including macOS (Apple Silicon and Intel), Linux (x64/arm64 with or without Vulkan/ROCm), Windows (CPU/CUDA/Vulkan), Android (arm64), and more. For developers working with edge AI or mobile LLM inference, this means higher-quality results from MoE models without sacrificing speed on Adreno-equipped devices.

Key Points
  • Adds q5_0 and q5_1 quantization for Mixture-of-Experts (MoE) models on Adreno GPUs.
  • Includes memory leak prevention and warning suppression for non-Adreno builds.
  • Available for all major platforms: macOS, Linux, Windows, Android, and multiple GPU backends (Vulkan, CUDA, ROCm).

Why It Matters

Mobile and edge LLM inference gets a quality boost with 5-bit MoE support on Qualcomm hardware.