b8751
Latest update makes Google's Gemma 2 models run faster with optional KV cache tensors and new hardware backends.
The open-source community behind llama.cpp, the high-performance inference engine for running LLMs locally, has shipped a significant new release tagged b8751. The update's headline feature is a targeted optimization for Google's recently released Gemma 2 family of models. Specifically, it modifies the model loader to make the shared Key-Value (KV) tail attention tensors optional during the loading process. This technical change can lead to reduced memory overhead and potentially faster initialization times for users running Gemma 2, making the models more accessible on consumer hardware.
Beyond the Gemma 2 tweak, the b8751 release significantly broadens the project's hardware compatibility. The build matrix now includes new packages for running inference on Linux systems using the Vulkan graphics API and Intel's OpenVINO toolkit, offering users more choices for acceleration. The release maintains comprehensive support across the ecosystem with updated builds for macOS (Apple Silicon and Intel), iOS, Windows (including CUDA 12/13, Vulkan, and SYCL), and various Linux distributions. This continuous expansion of supported backends is crucial for developers and researchers who need to deploy models across diverse hardware environments, from data center GPUs to edge devices and personal computers.
- Optimizes Google's Gemma 2 models by making shared-KV tail attention tensors optional on load, improving efficiency.
- Adds new build targets for Vulkan API and Intel OpenVINO support on Linux, expanding hardware acceleration options.
- Maintains wide platform support with updates for macOS, Windows (CUDA/Vulkan/SYCL), iOS, and various Linux backends (ROCm, CPU).
Why It Matters
Lowers the barrier to running state-of-the-art models like Gemma 2 locally, giving developers more hardware flexibility and efficiency.