Developer Tools

b9037

Qualcomm Hexagon users get faster local AI inference by moving computations to HMX.

Deep Dive

The latest release of llama.cpp, version b9037, targets significant performance gains on Qualcomm's Hexagon DSP architecture. The core change shifts m-tail row processing from HVX (Hexagon Vector eXtensions) to HMX (Hexagon Matrix eXtensions), which is better suited for matrix-heavy LLM workloads. This offloading reduces latency and improves throughput for models running on Snapdragon-powered edge devices. Additionally, the hmx-mm inner loop has been unrolled and optimized with padded activations, further enhancing computational efficiency.

The release also extends platform support, offering pre-built binaries for macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x, Vulkan, ROCm, OneAPI), Windows (CPU, CUDA, Vulkan, SYCL, HIP), Android (arm64), and openEuler. This ensures developers can deploy optimized local inference across diverse hardware, including mobile, desktop, and server environments. For professionals running private LLMs on Qualcomm hardware, this update delivers measurable speedups without quality loss.

Key Points
  • M-tail rows processed on HMX instead of HVX, improving Hexagon DSP efficiency for LLMs
  • Optimized padded activation loop with loop unrolling for faster matrix operations
  • Expanded build support including Android arm64, Windows arm64, and openEuler aarch64

Why It Matters

Faster on-device LLM inference on Qualcomm devices enables more responsive AI applications at the edge.