Developer Tools

llama.cpp b9128 boosts Hexagon DSP performance with new HVX helpers

New release cuts power consumption and speeds up local LLM inference on Qualcomm chips.

Deep Dive

llama.cpp's b9128 release from ggml-org introduces Hexagon DSP optimizations including HVX vector replication helpers, elimination of scalar VTCM loads, optimized per-group scale handling, slope load from VTCM, and aligned memory access. It also adds hvx_vec_repl_2x_f16 helper. The release supports macOS, Linux, Android, Windows, and openEuler builds.

Key Points
  • New hvx_vec_repl helpers replace scalar VTCM loads with vector splat operations, cutting memory latency.
  • hmx-mm per-group scale handling optimized for faster LLM weight dequantization on Hexagon DSP.
  • Supports macOS, Linux, Windows, and Android builds, including CUDA 12/13 and Vulkan backends.

Why It Matters

Makes local LLM inference on phones and edge devices faster and more power-efficient, enabling private AI use.