Developer Tools

llama.cpp b9510 speeds up WASM inference 3.4x with SIMD128 optimization

New release boosts q4_1_q8_1 dot product performance on WebAssembly by 3.4x.

Deep Dive

The llama.cpp project released version b9510, featuring a major performance optimization for WebAssembly (WASM) builds. The core change vectorizes the inner loop of ggml_vec_dot_q4_1_q8_1 using WASM SIMD128 intrinsics, which pack 32 4-bit weights into two u8x16 registers, widen to i16, and accumulate via dot product instructions. Benchmarks on Node.js v25 with emcc -O3 -msimd128 show a 3.42x speedup over the scalar reference (257.8 ns/call vs 880.7 ns/call). Correctness was verified across 10 random seeds with exact output match. The SIMD implementation was relocated to a dedicated wasm backend file (ggml-cpu/arch/wasm/quants.c) to follow architecture-specific layout, while the generic fallback remains in quants.c for non-WASM targets.

This optimization directly impacts developers building AI applications that run LLMs in the browser or on edge devices via WebAssembly. The q4_1_q8_1 dot product is a common kernel for quantized models (4-bit weights, 8-bit activations). A 3.4x speedup on WASM means faster token generation, lower latency for chat interfaces, and more responsive local AI tools without requiring native binaries. The change is transparent to users—existing llama.cpp clients on WASM will automatically benefit after upgrading. This release also reinforces llama.cpp's commitment to supporting diverse deployment scenarios, from desktop GPUs to lightweight WebAssembly environments.

Key Points
  • 3.42x speedup over scalar for q4_1_q8_1 dot product on WASM SIMD128 (257.8 ns/call vs 880.7 ns/call)
  • Implementation uses wasm_v128_load, AND/SHR to unpack nibbles, and 4x wasm_i32x4_dot_i16x8 for accumulation
  • Moved to dedicated wasm backend; correctness verified across 10 random seeds

Why It Matters

Faster WASM inference enables smoother local LLM experiences in browsers and edge devices without native dependencies.