Developer Tools

b8197

llama.cpp Releases March 04, 2026

⚡The latest commit delivers up to 10x faster matrix operations on Apple's AMX hardware.

Deep Dive

The open-source llama.cpp project, maintained by ggml-org, has released a significant performance optimization in commit b8197 that specifically targets Apple Silicon's AMX (Apple Matrix Coprocessor) hardware. The change replaces OpenMP threading with a simpler std::thread implementation for AMX operations, resulting in dramatically faster inference speeds for locally-run large language models on Mac computers. While this creates a slight penalty during model loading, the runtime performance gains are substantial enough to make this a net positive for most users who prioritize inference speed over initial load times.

The technical details reveal impressive benchmarks: the convert_B_packed_format() operation saw total execution time drop from 325.43ms to 78.97ms when OpenMP is disabled—a roughly 3x speedup. Individual matrix operations showed even more dramatic improvements, with one benchmark jumping from 2.55ms to 1.55ms (83% faster). The commit was tested using the unsloth/gpt-oss-20b-GGUF:Q4_K_M model and signed by Adrien Gallouët of Hugging Face, indicating collaboration between major open-source AI players. This optimization is particularly significant as Apple Silicon becomes increasingly popular for local AI development, giving Mac users access to faster, more efficient LLM inference without requiring specialized GPU hardware.

Key Points

Commit b8197 removes OpenMP dependency for Apple Silicon AMX, improving inference speed by up to 3x
Benchmarks show matrix operations dropping from 325ms to 79ms total execution time
Tested with unsloth/gpt-oss-20b-GGUF:Q4_K_M model, with slight loading slowdown as trade-off

Why It Matters

Mac developers and users get significantly faster local LLM inference, making Apple Silicon more competitive for AI workloads.

Read Original Article

b8197

Why It Matters

Stay Ahead in AI