Developer Tools

b8331

llama.cpp Releases March 14, 2026

⚡The b8331 release patches a dimension mismatch that caused ggml_mul_mat assertions to fail.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has released a critical patch (b8331) to fix a crash bug in its embedding model pipeline. The issue was a regression introduced by a previous commit (#20340) that affected the chunked fused Gated Delta Net (GDN) detection within the scheduler. Specifically, a dimension mismatch occurred in the `build_pooling()` function when processing embedding models configured with mean or rank pooling. This mismatch caused the underlying ggml tensor operation `ggml_mul_mat` to fail with an assertion error, crashing the process.

The bug manifested because the graph reservation for chunked GDN detection incorrectly passed `n_seqs` as the `n_outputs` parameter, while the mean pooling operation `build_inp_mean()` created a tensor with a shape based on `n_tokens` (which equals `16 * n_seqs`). This shape mismatch led to the failed matrix multiplication. The fix, which aligns with the pattern used in other worst-case reservation paths, passes `n_tokens` as `n_outputs` to ensure compatible dimensions.

To prevent future regressions, the release also adds new test cases (`test_embedding_pooling_mean` and `test_embedding_pooling_mean_multiple`) to the embedding test suite. These tests directly cover the previously untested `--pooling mean` code path. The update is available across all major platforms, including macOS (Apple Silicon and Intel), Linux (CPU, Vulkan, ROCm), and Windows (CPU, CUDA, Vulkan, SYCL, HIP).

Key Points

Fixes a regression from commit d28961d (#20340) that caused a crash in embedding models using mean/rank pooling.
Resolves a dimension mismatch in `build_pooling()` where `n_outputs` was incorrectly set to `n_seqs` instead of `n_tokens`.
Adds new test suite coverage for the `--pooling mean` code path to catch similar bugs in the future.

Why It Matters

Ensures stability for production deployments using llama.cpp's embedding features, which are critical for RAG and semantic search applications.

Read Original Article

b8331

Why It Matters

Stay Ahead in AI