Developer Tools

llama.cpp b9140 fixes MoE crash on Adreno GPUs

A critical bug fix for Qualcomm Adreno users running MoE models.

Deep Dive

llama.cpp, the popular open-source inference engine for large language models, has shipped version b9140 with a targeted fix for Qualcomm Adreno GPUs. The release addresses a crash that occurred during the warmup phase of Mixture of Experts (MoE) models, a common architecture for high-performance LLMs like Mixtral. The commit, signed with GitHub's verified signature, ensures integrity for production deployments.

Alongside the Adreno fix, b9140 delivers a massive expansion of pre-built binaries. The update now covers macOS (Apple Silicon arm64, Apple Silicon with KleidiAI optimizations, Intel x64, iOS XCFramework), Linux (x64, arm64, s390x for CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64/arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android arm64 CPU, and openEuler (x86 and aarch64 with ACL Graph support). This broad support makes it easier for developers and researchers to run LLMs on diverse hardware without compiling from source.

Key Points
  • Fixes crash when warming up Mixture of Experts (MoE) models on Qualcomm Adreno GPUs.
  • Signed commit (GitHub verified GPG key B5690EEEBB952194) ensures code integrity.
  • Now offers 30+ pre-built binaries across macOS, Windows, Linux, Android, and openEuler with CUDA, Vulkan, ROCm, SYCL, and HIP backends.

Why It Matters

Stable MoE inference on mobile/edge Adreno GPUs expands on-device LLM deployment possibilities.