Developer Tools

b8766

llama.cpp Releases April 12, 2026

⚡The latest commit enables local transcription and audio understanding for Google's Gemma 4 models.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has released a significant update (commit b8766) that brings native audio processing capabilities to Google's Gemma 4 models. This implementation adds a USM-style Conformer encoder specifically for the Gemma 4 E2B and E4B model variants, enabling local audio transcription and understanding. The architecture features a 12-layer Conformer with full self-attention, sinusoidal RPE, and a sliding window mask of 24 positions. It includes a dedicated audio preprocessor (mtmd_audio_preprocessor_gemma4a) that uses HTK mel scale with 128 bins and achieves a 0.9998 cosine similarity match with PyTorch's reference implementation.

The update represents a major step toward fully local multimodal AI, allowing developers to run audio-to-text transcription without cloud dependencies. The implementation has been tested across multiple backends including CPU, Vulkan, CUDA, and ROCm, and is available for all major platforms: macOS (Apple Silicon and Intel), Windows (x64 and arm64), Linux distributions, and even iOS via XCFramework. Key technical fixes include preventing tensor loading duplication with a std::set guard and ensuring proper sliding window mask alignment with PyTorch's context_size.

This integration means that applications built on llama.cpp can now process audio inputs through Gemma 4 models locally, opening possibilities for privacy-preserving transcription tools, offline voice assistants, and edge AI applications. The commit specifically enables transcription matching ground truth for test phrases like 'Glad to see things are going well and business is starting to pick up,' demonstrating production-ready accuracy.

Key Points

Adds Gemma 4 audio conformer encoder support for E2B/E4B models with 12-layer architecture and 1024→1536→RMSNorm output
Enables local audio transcription across CPU/GPU backends (Vulkan, CUDA, ROCm) on Windows, macOS, Linux, and iOS
Includes key fixes for tensor loading deduplication and proper sliding window mask alignment with PyTorch reference

Why It Matters

Enables fully local, privacy-preserving audio AI applications without cloud dependencies, expanding edge computing capabilities.

Read Original Article

b8766

Why It Matters

Stay Ahead in AI