b8853
A key fix for unaligned vocab sizes now lets models like HY-MT load without crashing.
The open-source project llama.cpp, maintained by ggml-org, has patched a significant technical bug in its latest commit (b8853). The issue was in the SYCL backend's reorder mul_mat_vec_q dispatchers for quantized models (Q4_0, Q8_0, Q4_K, Q6_K), which contained an assert that required the number of workgroups (block_num_y) to be a multiple of 16 subgroups. This caused models with vocabulary sizes not perfectly divisible by 16—such as the HY-MT model with 120,818 tokens—to crash immediately on load when processing the output projection layer.
The fix, developed with AI-assisted coding and tested on Intel B70 hardware, replaces the hard assert with intelligent padding. Now, block_num_y rounds up to the nearest whole number of subgroup-sized workgroups. Any extra padded threads cleanly exit early due to existing row bounds checks, preserving performance for aligned models and enabling previously incompatible ones to run. A secondary change replaced the hardcoded '16' with the `WARP_SIZE` constant for clarity, making the code more maintainable across different hardware targets. The commit, which also credits contributors @arthw and @NeoZhangJianyu, resolves GitHub issue #22020 and is part of the continuous effort to expand llama.cpp's robust multi-platform support, from Apple Silicon and CUDA to SYCL and Vulkan.
- Fixes a crash for models with vocab sizes not divisible by 16, like HY-MT (120,818 tokens).
- Patches SYCL backend dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K quantized models on Intel hardware.
- Replaces a hard assert with safe padding, ensuring no performance regression for aligned models.
Why It Matters
Removes a barrier to running cutting-edge open-source models locally, expanding the practical ecosystem for AI developers.