b8347
Latest commit patches critical tail corruption bug and optimizes matrix multiplication kernels for up to 10% speed gains.
The open-source powerhouse behind llama.cpp, ggml-org, has rolled out a significant technical update with commit b8347. This release primarily targets the Hexagon backend, which is crucial for running quantized AI models like Llama 3 on Qualcomm's Snapdragon hardware, commonly found in smartphones and laptops. The core fix addresses a 'tail corruption' bug that occurred when processing matrix rows not perfectly divisible by 256 in Q4_0 and MXFP4 data types, a scenario that could silently corrupt model outputs. The team resolved this by adjusting the data repacking strategy from an interleaved (0:128,1:129) pattern to an even:odd (0:1,2:3) pattern for the final block of data.
Beyond the critical bug fix, the commit delivers meaningful performance optimizations. The engineers refined the matrix multiplication (mm) kernels to apply the new even:odd repacking logic only to the problematic 'last block,' avoiding a performance penalty from unnecessary data shuffling in the common case. Additional tweaks include updating the `rmpy_x8` kernel for better optimization, tightening validation checks to prevent spurious failures, and using more efficient instructions (`vzero`) to initialize accumulators. These low-level hardware optimizations can translate to faster and more stable inference for local LLMs on a growing ecosystem of AI-ready Snapdragon devices, from the latest smartphones to upcoming AI PCs.
- Fixes critical 'tail corruption' bug in Hexagon backend for Q4_0/MXFP4 models when row size ≠ multiple of 256.
- Optimizes matrix multiplication kernels (rmpy_x8) and repacking logic to avoid performance hits from data shuffling.
- Enhances stability and speed for local LLM inference on Qualcomm Snapdragon platforms (phones, laptops).
Why It Matters
This update makes local AI more reliable and efficient on billions of Qualcomm-powered devices, a key battleground for on-device AI.