Developer Tools

b9077

llama.cpp Releases May 09, 2026

⚡Run local LLMs with Google Cloud’s Vertex AI API compatibility out of the box.

Deep Dive

The b9077 release of llama.cpp, the popular C++ implementation for running LLaMA and other large language models, brings a significant new capability: server support for a Vertex AI-compatible API. This means developers can now expose their local llama.cpp server endpoints using an API that mimics Google Cloud Vertex AI’s interface, allowing existing applications built for Vertex AI to run locally without code changes. The update also includes safer handling of AIP_* environment variables and various fixes for Windows, macOS, and Linux builds.

Beyond the Vertex AI integration, the release provides pre-built assets across an extensive range of platforms: macOS Apple Silicon (arm64, with and without KleidiAI), macOS Intel, iOS XCFramework, Linux on x64/arm64/s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL; Android arm64; Windows x64/arm64 CPU, CUDA 12/13, Vulkan, SYCL, and HIP; and openEuler on x86 and aarch64 with ACL Graph support. This broad support ensures developers can leverage the new API on virtually any hardware setup.

Key Points

New server mode supports Vertex AI-compatible API for easier integration with Google Cloud workflows
Release includes builds for 20+ platform variants including Apple Silicon, CUDA 12/13, ROCm, Vulkan, and SYCL
Fixes for Windows builds and safer handling of AIP_MODE environment variables

Why It Matters

Enables hybrid AI workflows by allowing local llama.cpp models to plug into Vertex AI client tooling.

Read Original Article

b9077

Why It Matters

Stay Ahead in AI