b9084
New HVX-optimized kernels slash vector reload overhead by 50% on Qualcomm HTP.
Release b9084 of llama.cpp, the popular open-source LLM inference framework, introduces a major optimization for Qualcomm Hexagon processors. The update adds HTP (Hexagon Tensor Processor) kernels for the Gated Delta Net recurrence operation. Key architectural improvements include 4-row fused kernels for the prompt processing path and 8-row fused kernels for the token generation path. This fusion reduces K/Q/gate vector reload overhead by a factor of 2, delivering significant throughput gains on edge devices. The release also separates PP and TG thread functions to improve I-cache isolation, and implements a VTCM state scratchpad with DMA in/out for single-cycle access during token generation. The vectorized gate exponential function uses the hvx_exp_f32 instruction for efficient computation.
The performance of these kernels is critical for running modern LLMs on Qualcomm's Hexagon DSP, commonly found in mobile and IoT devices. By reducing memory bandwidth pressure and optimizing compute patterns, llama.cpp b9084 enables faster token generation and lower latency on edge hardware. The release includes builds for multiple platforms including macOS (Apple Silicon, Intel), Linux (x64, arm64, s390x), Windows (CPU, CUDA, Vulkan, HIP, SYCL), iOS XCFramework, and Android arm64. This broad compatibility ensures developers can deploy the Gated Delta Net optimizations across a wide range of production environments, from cloud servers to edge devices.
- 4-row fused kernels for prompt processing and 8-row fused kernels for token generation reduce vector reload overhead by 2x
- Separate PP/TG thread functions improve I-cache isolation; VTCM scratchpad with DMA enables single-cycle token generation access
- Vectorized gate exponential via hvx_exp_f32 speed up Gated Delta Net recurrence on Qualcomm Hexagon HTP
Why It Matters
Unlocks efficient LLM inference on Qualcomm edge devices, enabling faster AI assistants and local processing.