b8138
Latest update expands hardware compatibility, enabling Llama models to run on AMD, Intel, and Nvidia GPUs.
The ggml-org team has released Llama.cpp version b8138, marking a significant expansion in hardware compatibility for the widely-used open-source inference engine. This update, signed by contributor Adrien Gallouët from Hugging Face, introduces support for multiple new GPU backends including Vulkan for AMD and Intel GPUs, SYCL for Intel hardware, and HIP for AMD's ROCm platform. These additions complement existing CUDA support, creating a more versatile framework for running Llama models across diverse hardware configurations.
The release includes pre-built binaries for Windows (x64 with CUDA 12/13, Vulkan, SYCL, and HIP), Linux (Ubuntu with CPU, Vulkan, and ROCm 7.2), macOS (Apple Silicon and Intel), and iOS. The technical foundation was strengthened by updating the cpp-httplib dependency to version 0.34.0, improving HTTP server capabilities that are crucial for API deployments. This multi-platform approach addresses one of the biggest challenges in AI deployment: hardware fragmentation.
For developers and enterprises, this means reduced vendor lock-in and increased flexibility in deployment strategies. Organizations can now deploy Llama models on AMD data center GPUs via ROCm, leverage Intel's latest accelerators through SYCL, or use consumer-grade AMD graphics cards via Vulkan—all from the same codebase. The update represents a strategic move toward hardware-agnostic AI inference, potentially lowering costs and increasing adoption of open-source language models in production environments.
- Adds Vulkan, SYCL, and HIP GPU backends alongside existing CUDA support
- Provides pre-built binaries for Windows, Linux, macOS, and iOS across multiple architectures
- Updates cpp-httplib to v0.34.0 for improved HTTP server functionality
Why It Matters
Reduces hardware vendor lock-in, enabling cost-effective deployment of Llama models across AMD, Intel, and Nvidia ecosystems.