Developer Tools

llama.cpp b9412 released with 3600s server timeout and new builds

llama.cpp's latest release extends server timeout to one hour and expands platform support.

Deep Dive

ggml-org has tagged llama.cpp version b9412, a maintenance release focused on server stability and broader platform availability. The most notable change is the server timeout being bumped to 3600 seconds (one hour), up from a lower default. This allows users to run large language model inference jobs that may take many minutes without the server terminating the connection prematurely—critical for batch processing or interactive sessions with long context windows.

Beyond the timeout change, this release delivers a comprehensive set of prebuilt binaries across nearly every major platform and hardware backend. Builds include macOS for both Apple Silicon and Intel, Linux for CPU (x64, arm64, s390x) as well as GPU backends (Vulkan, ROCm 7.2, OpenVINO, SYCL), Windows for CPU, arm64, CUDA 12/13, Vulkan, and HIP, plus Android arm64 and iOS XCFramework. These prebuilt artifacts eliminate the need for users to compile from source, significantly lowering the barrier to deploying llama.cpp in diverse environments.

Key Points
  • Server timeout increased from default to 3600 seconds (one hour) to handle long-running inference.
  • Prebuilt binaries released for macOS, Linux, Windows, Android, and iOS across multiple CPU and GPU backends.
  • Eliminates source compilation for most users, simplifying installation and reducing setup friction.

Why It Matters

Longer server timeout and broad platform support make local LLM deployment more reliable and accessible.