llama.cpp b9221 adds PAD op with Hexagon HVX kernels
New vectorized padding supports zero-padding and circular padding across all 4 tensor dimensions.
The latest release of llama.cpp (b9221) introduces a significant backend improvement: a PAD operation kernel for Qualcomm's Hexagon HTP (Hexagon Tensor Processor) architecture, leveraging HVX (Hexagon Vector eXtensions) vectorized instructions. This PR (pull request #23078) implements GGML_OP_PAD with support for both zero-padding and circular padding across all four tensor dimensions, a critical operation for many neural network layers that require dimension alignment or shape manipulation.
This addition is particularly important for deploying large language models on edge devices using Qualcomm's Hexagon DSP (Digital Signal Processor), as it allows more efficient tensor manipulation directly on the accelerator without CPU fallbacks. The release also includes cleanup of duplicate op cases and macro alignment fixes. Binary releases are available for multiple platforms including macOS (Apple Silicon and Intel), Linux (x64, ARM64, with Vulkan, ROCm, OpenVINO, SYCL support), Android (ARM64), Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP), and openEuler with Ascend NPU support.
- New GGML_OP_PAD HVX kernel for Hexagon HTP backend (PR #23078)
- Supports zero-padding and circular padding across all 4 tensor dimensions
- Part of llama.cpp b9221 release with multi-platform binary builds
Why It Matters
Enables efficient LLM inference on Qualcomm Hexagon processors, improving edge AI performance for on-device models.