b8784
The latest update enables local servers to process audio files just like OpenAI's Whisper API.
The open-source project llama.cpp, maintained by the ggml-org team, has released a significant update with commit b8784. The headline feature is the addition of server support for the OpenAI-compatible `/v1/audio/transcriptions` API. This means developers can now deploy a local inference server that accepts audio files and returns transcriptions using the exact same API specification as OpenAI's proprietary Whisper service. The implementation allows for seamless integration with existing applications built for OpenAI's API, but with the privacy, cost, and latency benefits of running models locally on consumer hardware.
Alongside this major API addition, the release includes a comprehensive set of pre-built binaries across multiple platforms and hardware backends. For Apple users, there are builds for macOS on both Apple Silicon (arm64) and Intel (x64) architectures, with a special KleidiAI-enabled variant for enhanced performance. Linux users get options for standard CPU inference, as well as accelerated builds leveraging Vulkan, ROCm 7.2 for AMD GPUs, and OpenVINO for Intel hardware. Windows support is equally robust, covering CPU, CUDA 12/13 for NVIDIA GPUs, Vulkan, and experimental SYCL and HIP backends. This broad compatibility ensures developers and researchers can leverage local audio transcription across their entire device ecosystem.
- Adds OpenAI-compatible `/v1/audio/transcriptions` API endpoint to llama.cpp server, enabling drop-in replacement for Whisper API
- Provides pre-built binaries for macOS (Apple Silicon/Intel), Linux (CPU/Vulkan/ROCm/OpenVINO), and Windows (CPU/CUDA/Vulkan/SYCL)
- Enables private, low-latency audio processing on local hardware using open models like Whisper.cpp
Why It Matters
Developers can now build audio AI features with OpenAI's API standard while keeping data private and reducing cloud costs.