Developer Tools

b8331

The b8331 release patches a dimension mismatch that caused ggml_mul_mat assertions to fail.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has released a critical patch (b8331) to fix a crash bug in its embedding model pipeline. The issue was a regression introduced by a previous commit (#20340) that affected the chunked fused Gated Delta Net (GDN) detection within the scheduler. Specifically, a dimension mismatch occurred in the `build_pooling()` function when processing embedding models configured with mean or rank pooling. This mismatch caused the underlying ggml tensor operation `ggml_mul_mat` to fail with an assertion error, crashing the process.

The bug manifested because the graph reservation for chunked GDN detection incorrectly passed `n_seqs` as the `n_outputs` parameter, while the mean pooling operation `build_inp_mean()` created a tensor with a shape based on `n_tokens` (which equals `16 * n_seqs`). This shape mismatch led to the failed matrix multiplication. The fix, which aligns with the pattern used in other worst-case reservation paths, passes `n_tokens` as `n_outputs` to ensure compatible dimensions.

To prevent future regressions, the release also adds new test cases (`test_embedding_pooling_mean` and `test_embedding_pooling_mean_multiple`) to the embedding test suite. These tests directly cover the previously untested `--pooling mean` code path. The update is available across all major platforms, including macOS (Apple Silicon and Intel), Linux (CPU, Vulkan, ROCm), and Windows (CPU, CUDA, Vulkan, SYCL, HIP).

Key Points
  • Fixes a regression from commit d28961d (#20340) that caused a crash in embedding models using mean/rank pooling.
  • Resolves a dimension mismatch in `build_pooling()` where `n_outputs` was incorrectly set to `n_seqs` instead of `n_tokens`.
  • Adds new test suite coverage for the `--pooling mean` code path to catch similar bugs in the future.

Why It Matters

Ensures stability for production deployments using llama.cpp's embedding features, which are critical for RAG and semantic search applications.