Developer Tools

llama.cpp b9810 adds CUDA support for AMD GPUs with new mapping

New release brings batched matrix multiplication support to HIP/MUSA on AMD GPUs

Deep Dive

The ggml-org team released llama.cpp version b9810 on June 26, 2024. This update focuses on expanding GPU support by adding a cublasSgemmBatched mapping for HIP/MUSA vendor headers. HIP (Heterogeneous-Compute Interface for Portability) and MUSA (a vendor-specific extension) allow AMD GPUs to run CUDA-like code. The batched SGEMM (Single-precision GEneral Matrix Multiply) mapping enables efficient handling of multiple small matrix multiplications in parallel, which is critical for transformer-based LLM inference. This move directly improves performance on AMD hardware, reducing the gap with NVIDIA GPUs for running models like LLaMA locally.

The release also includes prebuilt binaries for a wide range of platforms: macOS Apple Silicon (with optional KleidiAI acceleration), macOS Intel, Linux on x64/arm64/s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL; Windows x64 and arm64 with CPU, OpenCL Adreno, CUDA 12/13, Vulkan, OpenVINO, SYCL, and HIP; plus Android arm64 CPU and openEuler builds. The commit is signed and verified. This broad support reinforces llama.cpp's role as the go-to tool for developer-run, local AI inference across diverse hardware ecosystems.

Key Points
  • New cublasSgemmBatched mapping for HIP/MUSA vendor headers enhances CUDA-like performance on AMD GPUs
  • Prebuilt binaries for macOS, Linux, Windows, Android, and more, including ROCm, Vulkan, and SYCL support
  • Release signed with verified GPG key from ggml-org, ensuring integrity and community trust

Why It Matters

Expands local LLM inference to AMD GPUs, reducing reliance on NVIDIA and democratizing AI hardware choices.

📬 Get the top 10 AI stories daily