llama.cpp b9810 adds CUDA support for AMD GPUs with new mapping
New release brings batched matrix multiplication support to HIP/MUSA on AMD GPUs
The ggml-org team released llama.cpp version b9810 on June 26, 2024. This update focuses on expanding GPU support by adding a cublasSgemmBatched mapping for HIP/MUSA vendor headers. HIP (Heterogeneous-Compute Interface for Portability) and MUSA (a vendor-specific extension) allow AMD GPUs to run CUDA-like code. The batched SGEMM (Single-precision GEneral Matrix Multiply) mapping enables efficient handling of multiple small matrix multiplications in parallel, which is critical for transformer-based LLM inference. This move directly improves performance on AMD hardware, reducing the gap with NVIDIA GPUs for running models like LLaMA locally.
The release also includes prebuilt binaries for a wide range of platforms: macOS Apple Silicon (with optional KleidiAI acceleration), macOS Intel, Linux on x64/arm64/s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL; Windows x64 and arm64 with CPU, OpenCL Adreno, CUDA 12/13, Vulkan, OpenVINO, SYCL, and HIP; plus Android arm64 CPU and openEuler builds. The commit is signed and verified. This broad support reinforces llama.cpp's role as the go-to tool for developer-run, local AI inference across diverse hardware ecosystems.
- New cublasSgemmBatched mapping for HIP/MUSA vendor headers enhances CUDA-like performance on AMD GPUs
- Prebuilt binaries for macOS, Linux, Windows, Android, and more, including ROCm, Vulkan, and SYCL support
- Release signed with verified GPG key from ggml-org, ensuring integrity and community trust
Why It Matters
Expands local LLM inference to AMD GPUs, reducing reliance on NVIDIA and democratizing AI hardware choices.