Developer Tools

b8853

llama.cpp Releases April 20, 2026

⚡A key fix for unaligned vocab sizes now lets models like HY-MT load without crashing.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has patched a significant technical bug in its latest commit (b8853). The issue was in the SYCL backend's reorder mul_mat_vec_q dispatchers for quantized models (Q4_0, Q8_0, Q4_K, Q6_K), which contained an assert that required the number of workgroups (block_num_y) to be a multiple of 16 subgroups. This caused models with vocabulary sizes not perfectly divisible by 16—such as the HY-MT model with 120,818 tokens—to crash immediately on load when processing the output projection layer.

The fix, developed with AI-assisted coding and tested on Intel B70 hardware, replaces the hard assert with intelligent padding. Now, block_num_y rounds up to the nearest whole number of subgroup-sized workgroups. Any extra padded threads cleanly exit early due to existing row bounds checks, preserving performance for aligned models and enabling previously incompatible ones to run. A secondary change replaced the hardcoded '16' with the `WARP_SIZE` constant for clarity, making the code more maintainable across different hardware targets. The commit, which also credits contributors @arthw and @NeoZhangJianyu, resolves GitHub issue #22020 and is part of the continuous effort to expand llama.cpp's robust multi-platform support, from Apple Silicon and CUDA to SYCL and Vulkan.

Key Points

Fixes a crash for models with vocab sizes not divisible by 16, like HY-MT (120,818 tokens).
Patches SYCL backend dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K quantized models on Intel hardware.
Replaces a hard assert with safe padding, ensuring no performance regression for aligned models.

Why It Matters

Removes a barrier to running cutting-edge open-source models locally, expanding the practical ecosystem for AI developers.

Read Original Article

b8853

Why It Matters

Stay Ahead in AI