Metal concat kernel optimized with row batching for tensors with ne0 < 256, improving GPU occupancy?

Metal concat kernel optimized with row batching for tensors with ne0 < 256, improving GPU occupancy.

Fixed GGML_OP_SET kernel thread allocation to prevent incorrect execution?

Fixed GGML_OP_SET kernel thread allocation to prevent incorrect execution.

Added 50 new test cases for CPY reshaping operations across multiple data types (F32, F16, Q4_0)?

Added 50 new test cases for CPY reshaping operations across multiple data types (F32, F16, Q4_0).

Developer Tools

llama.cpp b9275 optimizes Metal concat kernel for Apple Silicon GPU performance

Q: Fixed GGML_OP_SET kernel thread allocation to prevent incorrect execution?

Fixed GGML_OP_SET kernel thread allocation to prevent incorrect execution.

Q: Added 50 new test cases for CPY reshaping operations across multiple data types (F32, F16, Q4_0)?

Added 50 new test cases for CPY reshaping operations across multiple data types (F32, F16, Q4_0).

llama.cpp Releases May 22, 2026

⚡New release boosts GPU occupancy for narrow tensors with row batching.

Deep Dive

The b9275 release of llama.cpp delivers targeted Metal GPU optimizations for Apple Silicon. The concat kernel now batches multiple rows into a single threadgroup when the tensor width (ne0) is less than 256. By dispatching up to 256 threads per group and calculating rows per threadgroup (nrptg), the kernel avoids underutilizing the GPU on narrow tensors, improving occupancy and inference throughput. Additionally, the GGML_OP_SET kernel thread count issue has been fixed to ensure correct parallel execution.

Alongside performance improvements, the release expands test coverage for copy operations (CPY). The test suite now includes 50 new reshaping test cases covering 1D-to-4D conversions, boundary conditions at 1024 elements, and small/large dimensionality changes. The tests are also refactored to use dimension permutations over {3, 5, 7, 32}, ensuring robustness across F32, F16, and Q4_0 data types. These changes benefit developers using llama.cpp for on-device AI on macOS, iOS, and other Metal-supported platforms.

Key Points

Metal concat kernel optimized with row batching for tensors with ne0 < 256, improving GPU occupancy.
Fixed GGML_OP_SET kernel thread allocation to prevent incorrect execution.
Added 50 new test cases for CPY reshaping operations across multiple data types (F32, F16, Q4_0).

Why It Matters

Faster local LLM inference on Apple GPUs means smoother on-device AI for professionals using llama.cpp.

Read Original Article

llama.cpp b9275 optimizes Metal concat kernel for Apple Silicon GPU performance

Why It Matters

Related Articles

🚀 Stay Ahead in AI