b8937
New release re-enables fast GELU_QUICK_F16 for Apple Silicon and x64 CPUs
The llama.cpp project, a popular open-source C++ implementation for running large language models locally, has released version b8937. This update re-enables the fast gelu_quick_f16 kernel for CPU inference, which had been previously disabled. The GELU (Gaussian Error Linear Unit) activation function is a core component of many transformer-based models, and this optimization improves its execution speed on supported hardware.
The release includes prebuilt binaries for a wide range of platforms: macOS (Apple Silicon and Intel), Linux (x64, ARM64, s390x), Windows (x64 and ARM64), and Android (ARM64). It also supports multiple GPU backends, including CUDA 12 and 13, Vulkan, ROCm 7.2, OpenVINO, SYCL (FP32 and FP16), and HIP. This broad compatibility ensures developers and enthusiasts can run models efficiently on nearly any hardware configuration. The release is signed with a verified GPG key for security.
- Re-enables fast gelu_quick_f16 kernel for CPU inference, improving activation function speed
- Supports macOS (Apple Silicon, Intel), Linux (x64, ARM64, s390x), Windows (x64, ARM64), and Android (ARM64)
- Includes GPU backends: CUDA 12/13, Vulkan, ROCm 7.2, OpenVINO, SYCL, and HIP
Why It Matters
Faster CPU inference for LLMs means better local performance, especially for Apple Silicon and x64 users.