Developer Tools

b8400

Latest update enables key Qwen 3.5 attention layers to run natively on Snapdragon chips.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has released a significant update (commit b8400) that brings specialized hardware acceleration to mobile AI. The core addition is native support for four mathematical operations—negation, exponential, sigmoid, and softplus—on Qualcomm's Hexagon Digital Signal Processor (DSP). These specific ops are critical components of Qwen 3.5's DeltaNet linear attention mechanism. By implementing them directly using Hexagon Vector eXtensions (HVX) intrinsics, the update allows these computationally intensive layers to run efficiently on the dedicated DSP hardware in Snapdragon chips, rather than falling back to slower CPU or GPU paths.

This optimization follows the project's existing pattern for unary operations and utilizes VTCM DMA double-buffering for memory efficiency. The commit also includes general improvements like a `CONT` (make contiguous) operation that reuses copy infrastructure and a multi-threaded `REPEAT` operation for tiled memory copies. The impact is direct: models like Qwen 3.5, which previously may have had performance bottlenecks on mobile devices, can now leverage the full power of the Hexagon DSP. This translates to faster token generation, lower power consumption, and more viable on-device execution for state-of-the-art language models, furthering the trend of moving AI inference away from the cloud.

Key Points
  • Adds native Hexagon DSP support for four key ops (neg, exp, sigmoid, softplus) needed by Qwen 3.5's DeltaNet.
  • Uses Qualcomm HVX intrinsics and VTCM DMA double-buffering for optimized performance on Snapdragon mobile chips.
  • Enables more efficient on-device inference for advanced models, reducing reliance on cloud APIs and improving speed/power.

Why It Matters

Unlocks faster, more power-efficient execution of cutting-edge models like Qwen 3.5 directly on smartphones, advancing on-device AI.