Developer Tools

llama.cpp b9161 adds Codex CLI support and tool handling improvements

The popular open-source LLM engine now works seamlessly with GitHub's Codex CLI...

Deep Dive

The ggml-org team has released version b9161 of llama.cpp, the C/C++ implementation of LLaMA and other large language models optimized for local inference. This release focuses on improving compatibility with Codex CLI, GitHub's command-line tool for AI-assisted coding. The key change is that llama.cpp now gracefully handles unsupported Responses tools by skipping them and issuing a warning, rather than failing outright. This ensures smoother integration with Codex workflows that expect wide tool support. Additionally, the release reverts a previous special handling for gpt-oss apply_patch, indicating a move toward more standardized behavior.

The release is accompanied by a massive set of pre-built binaries covering virtually every platform developers might use. Apple users get separate builds for Apple Silicon (with and without KleidiAI optimizations) and Intel x64, plus an iOS XCFramework. Linux users can choose from CPU-only builds for x64, arm64, and s390x, as well as GPU-accelerated builds for Vulkan (x64 and arm64), ROCm 7.2, OpenVINO, and SYCL (FP32/FP16). Windows builds cover CPU (x64 and arm64), CUDA with both DLL v12.4 and v13.1, Vulkan, SYCL, and HIP. Even Android arm64 and openEuler for specific Huawei chipsets are included. With 110K stars on GitHub, llama.cpp continues to be one of the most actively maintained open-source LLM projects.

Key Points
  • Version b9161 adds support for Codex CLI by skipping unsupported Responses tools and warning users.
  • Reverses special gpt-oss apply_patch handling to align with standard tool expectations.
  • Pre-built binaries cover Apple Silicon, Intel, Linux, Windows, Android, openEuler, with GPU backends including CUDA 12/13, Vulkan, ROCm, OpenVINO, and SYCL.

Why It Matters

llama.cpp's Codex CLI support opens local LLM inference to more developer workflows and CI pipelines.