Developer Tools

b8333

llama.cpp Releases March 14, 2026

⚡The latest commit eliminates a major performance regression in Qwen3.5-9B on Apple Silicon.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has released a significant performance update with commit b8333. This patch addresses a critical 39% performance regression discovered when running the Qwen3.5-9B model on Apple's M4 Max GPU using the Metal backend. The regression stemmed from inefficient memory access patterns in the fused Gated Delta Net (GDN) kernel, where the state matrix was being accessed column-wise on row-major storage, causing strided reads that wasted GPU cache bandwidth.

The technical fix involves transposing the state indexing so threads read contiguously, transforming memory access from strided patterns to coalesced reads. For Metal, this changed access from s_ptr[is*S_v] to s_ptr[is], reducing stride from 512 bytes to 1. The update also introduces a new --fused-gdn [on|off|auto] command-line flag, giving users direct control over the fused GDN feature independently of auto-detection, similar to the existing --flash-attn flag. This ensures compatibility and prevents state layout mismatches between fused and unfused execution paths.

Additionally, the CPU implementation received SIMD optimizations with ggml_vec_dot_f32 replacing scalar inner loops for dot products in the fused GDN kernel. The commit represents a collaborative effort with contributions from multiple developers and was notably co-authored by Claude Opus 4.6, highlighting the growing role of AI assistants in code optimization. The changes have passed all GATED_DELTA_NET backend-ops tests and are available across all supported platforms including macOS, Linux, Windows, and openEuler.

Key Points

Fixes 39% performance regression for Qwen3.5-9B on Apple M4 Max Metal backend
Optimizes fused GDN kernel for coalesced memory reads across Metal, CUDA, and CPU
Adds new --fused-gdn CLI flag for user control and prevents state layout mismatches

Why It Matters

This optimization directly impacts inference speed for developers running large language models locally, particularly on Apple Silicon hardware.

Read Original Article

b8333

Why It Matters

Stay Ahead in AI