llama.cpp b9222 adds Hexagon TRI op for faster AI inference
Qualcomm Hexagon HTP gets a new TRI operator for on-device LLMs...
The latest release of llama.cpp (b9222) introduces support for the TRI (Tensor Reduction and Interleaving) operation on Qualcomm Hexagon HTP (Hexagon Tensor Processor) cores. This addition, contributed by Todor Boinovski and Max Krasnyansky from Qualcomm, enables more efficient neural network inference on Hexagon-based hardware – commonly found in smartphones and edge devices. The TRI op is critical for optimizing certain tensor operations used in large language models, allowing better utilization of Hexagon's vector processing capabilities.
Beyond the Hexagon enhancement, this release includes a broad set of platform builds: macOS (Apple Silicon, Intel, iOS), Linux (x64, arm64, s390x, with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64, arm64, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (x86 and aarch64 with various backends). The release also cleans up merge conflict artifacts and editor configuration errors in the Hexagon unary and ggml op code. This comprehensive packaging makes llama.cpp more accessible for developers targeting diverse hardware accelerators.
- Hexagon TRI op added for Qualcomm HTP, enabling optimized tensor reduction/interleaving on edge AI hardware
- Build support expanded across 12+ platform/backend combos including CUDA 12/13, ROCm 7.2, Vulkan, SYCL, and more
- Collaborative contribution from Qualcomm engineers (Todor Boinovski, Max Krasnyansky) with verified GPG signature
Why It Matters
Brings on-device LLM inference to Qualcomm Hexagon hardware, unlocking faster AI performance on mobile and edge devices.