llama.cpp b9208 boosts SYCL performance with oneMKL routing
New release optimizes small matmul speed on Intel GPUs and more
The latest release of llama.cpp, version b9208, introduces a targeted performance improvement for SYCL (Intel's open programming model for heterogeneous computing). Specifically, the change routes small float32 matrix multiplication operations directly to Intel's Math Kernel Library (oneMKL) instead of oneDNN, which is the default deep neural network library. This bypass reduces overhead for small matrices—common in intermediate transformer computations—and can yield measurable speedups during LLM inference on Intel GPUs and integrated graphics. The patch was contributed by Chun Tao from Intel, with a signed-off commit (5511965).
Alongside the SYCL tweak, b9208 expands platform support to an extensive range of builds. Users can download pre-compiled binaries for macOS (Apple Silicon with optional KleidiAI acceleration, Intel x64, iOS XCFramework), Linux (x64, arm64, s390x, with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64 CPU, arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android arm64, and openEuler (x86 and aarch64). This broad compatibility allows AI practitioners to run LLMs like Llama, Mistral, and others locally on nearly any modern hardware, with optimized performance for Intel-based systems.
- SYCL small f32 matmuls now route to oneMKL, bypassing oneDNN for lower latency
- Pre-built binaries available for macOS, Linux, Windows, Android, and openEuler
- Supports multiple backends: CUDA 12/13, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16, HIP
Why It Matters
LLaMA.cpp remains the go-to for local LLM inference; this update delivers critical speedups on Intel hardware.