Developer Tools

b8937

llama.cpp Releases April 26, 2026

⚡New release re-enables fast GELU_QUICK_F16 for Apple Silicon and x64 CPUs

Deep Dive

The llama.cpp project, a popular open-source C++ implementation for running large language models locally, has released version b8937. This update re-enables the fast gelu_quick_f16 kernel for CPU inference, which had been previously disabled. The GELU (Gaussian Error Linear Unit) activation function is a core component of many transformer-based models, and this optimization improves its execution speed on supported hardware.

The release includes prebuilt binaries for a wide range of platforms: macOS (Apple Silicon and Intel), Linux (x64, ARM64, s390x), Windows (x64 and ARM64), and Android (ARM64). It also supports multiple GPU backends, including CUDA 12 and 13, Vulkan, ROCm 7.2, OpenVINO, SYCL (FP32 and FP16), and HIP. This broad compatibility ensures developers and enthusiasts can run models efficiently on nearly any hardware configuration. The release is signed with a verified GPG key for security.

Key Points

Re-enables fast gelu_quick_f16 kernel for CPU inference, improving activation function speed
Supports macOS (Apple Silicon, Intel), Linux (x64, ARM64, s390x), Windows (x64, ARM64), and Android (ARM64)
Includes GPU backends: CUDA 12/13, Vulkan, ROCm 7.2, OpenVINO, SYCL, and HIP

Why It Matters

Faster CPU inference for LLMs means better local performance, especially for Apple Silicon and x64 users.

Read Original Article

b8937

Why It Matters

Stay Ahead in AI