Developer Tools

llama.cpp b9193 fixes embedding normalization in server mode

No more hard-coded defaults: llama-server now honors --embd-normalize flag.

Deep Dive

ggml-org's llama.cpp has released b9193, a patch release that addresses a bug in the server mode's embedding normalization. The --embd-normalize flag was previously registered only for the embedding and debug examples, causing llama-server to reject it and hard-code a default of 2 (L2 normalization). This fix adds LLAMA_EXAMPLE_SERVER to the flag's example set and reads params.embd_normalize as the handler's default, while still allowing per-request overrides via the embd_normalize body field.

This release comes with pre-built binaries for a wide range of platforms: macOS (Apple Silicon, Intel, iOS XCFramework), Linux (x64/arm64 CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows (x64/arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (x86/aarch64 with ACL Graph). The fix ensures consistent embedding behavior across client and server workflows, making llama.cpp more reliable for RAG pipelines and similarity search tasks.

Key Points
  • llama-server now honors the --embd-normalize CLI argument instead of hard-coding L2 normalization
  • Fix applies to all platforms: macOS, Linux, Windows, Android, iOS, and openEuler
  • Per-request embd_normalize body field still overrides the server default

Why It Matters

Ensures consistent embedding normalization in llama-server, critical for accurate similarity search and RAG applications.