b9037
Qualcomm Hexagon users get faster local AI inference by moving computations to HMX.
The latest release of llama.cpp, version b9037, targets significant performance gains on Qualcomm's Hexagon DSP architecture. The core change shifts m-tail row processing from HVX (Hexagon Vector eXtensions) to HMX (Hexagon Matrix eXtensions), which is better suited for matrix-heavy LLM workloads. This offloading reduces latency and improves throughput for models running on Snapdragon-powered edge devices. Additionally, the hmx-mm inner loop has been unrolled and optimized with padded activations, further enhancing computational efficiency.
The release also extends platform support, offering pre-built binaries for macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x, Vulkan, ROCm, OneAPI), Windows (CPU, CUDA, Vulkan, SYCL, HIP), Android (arm64), and openEuler. This ensures developers can deploy optimized local inference across diverse hardware, including mobile, desktop, and server environments. For professionals running private LLMs on Qualcomm hardware, this update delivers measurable speedups without quality loss.
- M-tail rows processed on HMX instead of HVX, improving Hexagon DSP efficiency for LLMs
- Optimized padded activation loop with loop unrolling for faster matrix operations
- Expanded build support including Android arm64, Windows arm64, and openEuler aarch64
Why It Matters
Faster on-device LLM inference on Qualcomm devices enables more responsive AI applications at the edge.