Developer Tools

b9066

New release delivers faster batched LLM inference on NVIDIA GPUs via optimized CUDA kernels

Deep Dive

ggml-org's llama.cpp, the popular open-source C/C++ library for running large language models locally, has released version b9066. This update focuses on a performance-critical CUDA optimization: the batch out_prod inner loop now leverages cublasSgemmStridedBatched instead of earlier, less efficient approaches. The out_prod operation (outer product) is used in attention mechanisms and feed-forward layers, where batched execution can significantly accelerate inference.

The change, contributed in PR #22651, also adds cublasSgemmStridedBatched mappings for HIP (AMD ROCm) and MUSA (Moore Threads) backends, extending the benefit beyond NVIDIA GPUs. The release includes pre-built binaries for over 30 platforms, from Windows x64 (with CUDA 12/13 DLLs) to macOS Apple Silicon, Linux ARM64, and even openEuler. This makes high-performance local LLM inference more accessible to a wider range of hardware, especially for power users running large models like Llama 3 or Mixtral.

Key Points
  • b9066 adds CUDA batch out_prod inner loop using cublasSgemmStridedBatched for faster matrix multiplications
  • Extends support to HIP (AMD) and MUSA (Moore Threads) backends for cross-platform GPU acceleration
  • Available pre-built for 30+ targets including Windows, macOS, Linux, Android, and iOS

Why It Matters

Local LLM inference gets a meaningful speed boost for batched workloads, enabling faster responses on consumer GPUs.