llama.cpp b9275 optimizes Metal concat kernel for Apple Silicon GPU performance
New release boosts GPU occupancy for narrow tensors with row batching.
The b9275 release of llama.cpp delivers targeted Metal GPU optimizations for Apple Silicon. The concat kernel now batches multiple rows into a single threadgroup when the tensor width (ne0) is less than 256. By dispatching up to 256 threads per group and calculating rows per threadgroup (nrptg), the kernel avoids underutilizing the GPU on narrow tensors, improving occupancy and inference throughput. Additionally, the GGML_OP_SET kernel thread count issue has been fixed to ensure correct parallel execution.
Alongside performance improvements, the release expands test coverage for copy operations (CPY). The test suite now includes 50 new reshaping test cases covering 1D-to-4D conversions, boundary conditions at 1024 elements, and small/large dimensionality changes. The tests are also refactored to use dimension permutations over {3, 5, 7, 32}, ensuring robustness across F32, F16, and Q4_0 data types. These changes benefit developers using llama.cpp for on-device AI on macOS, iOS, and other Metal-supported platforms.
- Metal concat kernel optimized with row batching for tensors with ne0 < 256, improving GPU occupancy.
- Fixed GGML_OP_SET kernel thread allocation to prevent incorrect execution.
- Added 50 new test cases for CPY reshaping operations across multiple data types (F32, F16, Q4_0).
Why It Matters
Faster local LLM inference on Apple GPUs means smoother on-device AI for professionals using llama.cpp.