Developer Tools

llama.cpp b9330 fixes tensor bug, boosting Nemotron 120B speed by 58%

A single tensor type fix unlocks 103 tokens per second on a 120B model.

Deep Dive

The open-source llama.cpp project, maintained by ggml-org, shipped a critical performance fix in release b9330. The issue stemmed from how the LLM tensor info system declared the ffn_latent_down and ffn_latent_up tensors. They were labeled as GGML_OP_MUL (elementwise multiplication), but the Nemotron model actually feeds them through ggml_mul_mat (matrix multiplication). This mismatch caused the backend's 'buft probe' to ask the wrong question when deciding whether to keep the weight on GPU. Since the declared operation was elementwise MUL but the actual operation was MUL_MAT, the probe returned false when it should have returned true, forcing the loader to push both the weight and its matrix multiplication to CPU. This split the computational graph and drastically reduced throughput.

With the fix, the tensors are now tagged as GGML_OP_MUL_MAT, which correctly asks the GPU backend about its matrix multiplication capabilities. The result is a seamless GPU-bound execution. Benchmarked on Nemotron 3 Super 120B at Q5_K_M quantization, performance leaped from 64.9 tokens per second to 103.22 t/s — a 58% improvement. The release also rolls out the usual cross-platform binaries: macOS Apple Silicon (with optional KleidiAI), Intel, iOS, Linux (CPU, Vulkan, ROCm, OpenVINO, SYCL), Android, and Windows (CPU, CUDA, Vulkan, HIP, SYCL). For developers and AI practitioners running large local models, this update is a no-brainer upgrade.

Key Points
  • Performance jump from 64.9 to 103.22 t/s on Nemotron 3 Super 120B Q5_K_M.
  • Bug caused by ffn_latent tensors mislabeled as elementwise MUL instead of matrix multiply MUL_MAT.
  • Fix ensures GPU backend correctly assigns weights, eliminating CPU graph splits.

Why It Matters

Fixes a subtle tensor bug that gave up to 58% free speed on large LLMs — essential for local inference users.