Developer Tools

b8992

Now run LLMs in the browser with 32-bit WASM and handle models over 2GB...

Deep Dive

The llama.cpp project has shipped b8992, a maintenance release that significantly improves memory-mapped file handling for large language models. The headline change updates llama-mmap to use ftello/fseeko, enabling 32-bit WebAssembly targets to handle models exceeding 2GB. This is a critical fix for browser-based or lightweight environments where WASM is used, as previous versions would fail with large model files due to 32-bit offset limitations.

The release also refines the build system, updating to the newer gguf.cpp style and adding support for KleidiAI on Apple Silicon macOS/iOS. The release includes prebuilt binaries for a wide range of platforms: macOS (Apple Silicon with optional Kleidi, Intel, iOS), Linux (CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL), Android (arm64), Windows (CPU, CUDA 12 & 13, Vulkan, SYCL, HIP), and openEuler. This broad coverage ensures developers can run llama.cpp with optimal performance on nearly any device, from edge servers to gaming PCs to mobile phones.

Key Points
  • Updated llama-mmap to use ftello/fseeko for proper large file support on 32-bit systems
  • Now supports 32-bit WebAssembly (WASM) with models >2GB for browser-based LLM inference
  • Adds KleidiAI acceleration for Apple Silicon and expands build targets to 30+ platform configs

Why It Matters

Enables running larger local LLMs in browsers and resource-constrained environments, expanding on-device AI possibilities.