llama.cpp b9388 adds MMVQ optimization for Turing GPUs
Fixes JIT compilation mismatch for Turing code on Ampere+ GPUs
The open-source llama.cpp project, popular for running large language models locally, has released version b9388. This release centers on MMVQ (Matrix Multiplication with Variable Quantization) optimization, specifically adding MMVQ_PARAMETERS_TURING for SM75 GPU architecture. The key fix addresses a JIT (just-in-time) compilation mismatch: previously, code optimized for Turing GPUs (e.g., RTX 20 series) could incorrectly compile on Ampere or newer GPUs, leading to suboptimal performance or errors. Now, the system correctly selects the appropriate parameter table, ensuring optimal operation across GPU generations.
As with previous releases, this version is packaged for a wide range of platforms: macOS (Apple Silicon, Intel, iOS), Linux (x64, ARM, s390x, Vulkan, ROCm 7.2, OpenVINO, SYCL), Android (ARM), and Windows (CPU, CUDA 12/13, Vulkan, HIP). The update ensures users can continue leveraging local LLM inference efficiently, regardless of their hardware. The release was made by Johannes Gäßler with contributions from Copilot and was signed with a verified GPG key.
- Added MMVQ_PARAMETERS_TURING for SM75 Turing GPUs to fix JIT compilation mismatch on Ampere or newer hardware
- Supports multiple backends: CUDA 12/13, ROCm 7.2, Vulkan, OpenVINO, SYCL, and HIP across Windows, Linux, macOS, Android
- Ensures correct parameter selection for matrix operations across GPU architectures, improving reliability and performance
Why It Matters
Maintains local LLM inference efficiency across GPU generations, preserving accessibility for users with older hardware.