b8935
Open-source LLM runner gets faster inference on mobile GPUs
The open-source llama.cpp project, which enables local execution of large language models, released version b8935 with a focus on GPU acceleration. The key addition is iq4_nl quantization support for OpenCL, a compute framework that runs on diverse GPUs. Specifically, the update includes optimized GEMM (general matrix multiply) and GEMV (general matrix-vector) kernels for Adreno GPUs, commonly found in Qualcomm Snapdragon mobile chipsets. This allows models quantized to 4-bit (iq4_nl) to run faster on Android devices and other OpenCL-capable hardware.
The release also packs 2 lookup table entries into a single uint (32-bit unsigned integer) to improve memory efficiency during inference. Builds are available across major platforms: macOS (Apple Silicon with optional KleidiAI acceleration, Intel x64, iOS XCFramework), Linux (x64/arm64/s390x CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Windows (x64/arm64 CPU, CUDA 12/13, Vulkan, SYCL, HIP), Android (arm64 CPU), and openEuler (x86/aarch64 with Ascend 310p/910b). The release is signed with GitHub's verified GPG key for security.
- Adds iq4_nl quantization support for OpenCL, enabling faster LLM inference on GPUs
- Includes optimized GEMM/GEMV kernels specifically for Adreno GPUs (Qualcomm mobile chips)
- Packs 2 LUT entries into a single uint for improved memory efficiency during inference
Why It Matters
Makes running LLMs locally on Android devices and mobile GPUs faster and more memory-efficient.