b9077
Run local LLMs with Google Cloud’s Vertex AI API compatibility out of the box.
The b9077 release of llama.cpp, the popular C++ implementation for running LLaMA and other large language models, brings a significant new capability: server support for a Vertex AI-compatible API. This means developers can now expose their local llama.cpp server endpoints using an API that mimics Google Cloud Vertex AI’s interface, allowing existing applications built for Vertex AI to run locally without code changes. The update also includes safer handling of AIP_* environment variables and various fixes for Windows, macOS, and Linux builds.
Beyond the Vertex AI integration, the release provides pre-built assets across an extensive range of platforms: macOS Apple Silicon (arm64, with and without KleidiAI), macOS Intel, iOS XCFramework, Linux on x64/arm64/s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL; Android arm64; Windows x64/arm64 CPU, CUDA 12/13, Vulkan, SYCL, and HIP; and openEuler on x86 and aarch64 with ACL Graph support. This broad support ensures developers can leverage the new API on virtually any hardware setup.
- New server mode supports Vertex AI-compatible API for easier integration with Google Cloud workflows
- Release includes builds for 20+ platform variants including Apple Silicon, CUDA 12/13, ROCm, Vulkan, and SYCL
- Fixes for Windows builds and safer handling of AIP_MODE environment variables
Why It Matters
Enables hybrid AI workflows by allowing local llama.cpp models to plug into Vertex AI client tooling.