Developer Tools

llama.cpp b9208 boosts SYCL performance with oneMKL routing

New release optimizes small matmul speed on Intel GPUs and more

Deep Dive

The latest release of llama.cpp, version b9208, introduces a targeted performance improvement for SYCL (Intel's open programming model for heterogeneous computing). Specifically, the change routes small float32 matrix multiplication operations directly to Intel's Math Kernel Library (oneMKL) instead of oneDNN, which is the default deep neural network library. This bypass reduces overhead for small matrices—common in intermediate transformer computations—and can yield measurable speedups during LLM inference on Intel GPUs and integrated graphics. The patch was contributed by Chun Tao from Intel, with a signed-off commit (5511965).

Alongside the SYCL tweak, b9208 expands platform support to an extensive range of builds. Users can download pre-compiled binaries for macOS (Apple Silicon with optional KleidiAI acceleration, Intel x64, iOS XCFramework), Linux (x64, arm64, s390x, with Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64 CPU, arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android arm64, and openEuler (x86 and aarch64). This broad compatibility allows AI practitioners to run LLMs like Llama, Mistral, and others locally on nearly any modern hardware, with optimized performance for Intel-based systems.

Key Points
  • SYCL small f32 matmuls now route to oneMKL, bypassing oneDNN for lower latency
  • Pre-built binaries available for macOS, Linux, Windows, Android, and openEuler
  • Supports multiple backends: CUDA 12/13, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16, HIP

Why It Matters

LLaMA.cpp remains the go-to for local LLM inference; this update delivers critical speedups on Intel hardware.