b8718
The latest update brings Vulkan, ROCm, and CUDA support across Windows, Linux, and macOS platforms.
The ggml-org team behind the widely-used llama.cpp project has shipped version b8718, marking a significant infrastructure upgrade for running large language models locally. This release dramatically expands GPU acceleration options, adding support for Vulkan, ROCm 7.2, CUDA 12.4, CUDA 13.1, OpenVINO, and SYCL backends across Windows, Linux, and macOS platforms. Developers can now leverage more hardware for faster inference, whether they're using NVIDIA, AMD, or Intel graphics cards, or even specialized AI accelerators like Huawei's Ascend chips via the included openEuler builds.
Beyond hardware support, the update introduces a crucial server-side feature: respecting the 'ignore EOS' flag. This gives developers precise control over text generation, allowing them to prevent models from stopping prematurely at end-of-sequence tokens when building streaming applications or chatbots. The release also includes pre-built binaries for Apple Silicon (with optional KleidiAI acceleration), iOS frameworks, and various CPU-only builds, making deployment easier across the entire ecosystem from servers to mobile devices.
- Adds Vulkan, ROCm 7.2, CUDA 12.4/13.1, OpenVINO, and SYCL GPU backends for accelerated inference
- Introduces server flag to respect 'ignore EOS' tokens for better generation control
- Provides pre-built binaries for Windows, Linux, macOS (Apple Silicon/Intel), iOS, and openEuler with Ascend support
Why It Matters
Enables faster, cheaper local AI inference across more hardware, reducing dependency on cloud APIs for developers building LLM applications.