Added MMVQ_PARAMETERS_TURING for SM75 Turing GPUs to fix JIT compilation mismatch on Ampere or newer hardware?

Added MMVQ_PARAMETERS_TURING for SM75 Turing GPUs to fix JIT compilation mismatch on Ampere or newer hardware

Supports multiple backends?

CUDA 12/13, ROCm 7.2, Vulkan, OpenVINO, SYCL, and HIP across Windows, Linux, macOS, Android

Ensures correct parameter selection for matrix operations across GPU architectures, improving reliability and performance?

Ensures correct parameter selection for matrix operations across GPU architectures, improving reliability and performance

Developer Tools

llama.cpp b9388 adds MMVQ optimization for Turing GPUs

llama.cpp Releases May 29, 2026

⚡Fixes JIT compilation mismatch for Turing code on Ampere+ GPUs

Deep Dive

The open-source llama.cpp project, popular for running large language models locally, has released version b9388. This release centers on MMVQ (Matrix Multiplication with Variable Quantization) optimization, specifically adding MMVQ_PARAMETERS_TURING for SM75 GPU architecture. The key fix addresses a JIT (just-in-time) compilation mismatch: previously, code optimized for Turing GPUs (e.g., RTX 20 series) could incorrectly compile on Ampere or newer GPUs, leading to suboptimal performance or errors. Now, the system correctly selects the appropriate parameter table, ensuring optimal operation across GPU generations.

As with previous releases, this version is packaged for a wide range of platforms: macOS (Apple Silicon, Intel, iOS), Linux (x64, ARM, s390x, Vulkan, ROCm 7.2, OpenVINO, SYCL), Android (ARM), and Windows (CPU, CUDA 12/13, Vulkan, HIP). The update ensures users can continue leveraging local LLM inference efficiently, regardless of their hardware. The release was made by Johannes Gäßler with contributions from Copilot and was signed with a verified GPG key.

Key Points

Added MMVQ_PARAMETERS_TURING for SM75 Turing GPUs to fix JIT compilation mismatch on Ampere or newer hardware
Supports multiple backends: CUDA 12/13, ROCm 7.2, Vulkan, OpenVINO, SYCL, and HIP across Windows, Linux, macOS, Android
Ensures correct parameter selection for matrix operations across GPU architectures, improving reliability and performance

Why It Matters

Maintains local LLM inference efficiency across GPU generations, preserving accessibility for users with older hardware.

Read Original Article

llama.cpp b9388 adds MMVQ optimization for Turing GPUs

Why It Matters

Related Articles

🚀 Stay Ahead in AI