Developer Tools

llama.cpp b9122 fixes WebGPU precision for multimodal models

New release addresses critical precision issues in WebGPU backend for multimodal AI.

Deep Dive

llama.cpp, the widely-used C/C++ inference engine for large language models (110k GitHub stars), has released version b9122. The release focuses on fixing precision issues in the WebGPU backend, which is critical for running multimodal models (handling text, images, etc.) on GPUs through WebGPU. Specific fixes include correcting the GELU, GELU quick, and GELU erf functions, using f32 precision for shared memory calculations, and resolving hardcoded value types in flash attention tile paths. The release also addresses NaN issues by using clamp for GELU and ensures safe exponential ranges for f32.

This update is accompanied by builds for multiple platforms: macOS (Apple Silicon, Intel, iOS), Linux (x64, arm64, s390x, with Vulkan, ROCm, OpenVINO, SYCL), Windows (x64/arm64 with CPU, CUDA, Vulkan, SYCL, HIP), Android (arm64), and openEuler. The extensive platform support makes latest llama.cpp accessible for local LLM deployment across diverse hardware. Developers using multimodal models on WebGPU should upgrade to b9122 to avoid precision degradation.

Key Points
  • Fixes WebGPU precision issues for multimodal models, enabling accurate image+text inference
  • Corrects GELU, GELU quick, and GELU erf activation functions, with clamp to prevent NaN
  • Updates shared memory calculation logic for f32 mixed types and improves flash attention tile path

Why It Matters

Local LLM inference becomes more reliable for multimodal apps, especially for developers using WebGPU on diverse hardware.