b8334
New commit resolves KVU flag errors for Hellaswag and Winogrande tasks, improving evaluation accuracy.
The open-source project llama.cpp, maintained by ggml-org, has pushed a significant update with commit b8334. This release addresses a persistent technical hurdle in evaluating language models on specific multiple-choice reasoning benchmarks. The core fix enables the KVU (key-value update) mode during perplexity calculations for tasks like Hellaswag and Winogrande. Previously, running evaluations without the `-kvu` flag would trigger a 'split_equal' error, causing the system to fail when finding memory slots for batches and ultimately crash the `llama_decode()` function. This prevented accurate scoring of models on these important benchmarks.
The commit, signed by Adrien Gallouët of Hugging Face, is a backend improvement that enhances the toolkit's reliability for researchers and developers. It ensures that when using commands like `llama-perplexity` to test models (such as the small `unsloth/Qwen3-0.6B-GGUF`), the evaluation process completes successfully. This fix is crucial for the open-source AI community that relies on llama.cpp for efficient, cross-platform inference (supporting macOS, Linux, Windows, and even iOS). By stabilizing the evaluation pipeline, it allows for more consistent benchmarking and comparison of model capabilities on tasks designed to measure commonsense and causal reasoning.
- Commit b8334 fixes a 'split_equal' error in llama.cpp that crashed perplexity scoring for Hellaswag and Winogrande tasks.
- The solution enables KVU (key-value update) mode to properly handle coupled sequences in input batches during decoding.
- This backend update ensures reliable evaluation of models like the 0.6B parameter Qwen3 on standardized reasoning benchmarks.
Why It Matters
This fix ensures reliable, apples-to-apples benchmarking for the open-source AI community, crucial for tracking model progress on reasoning tasks.