Developer Tools

b9045

Run IBM's Granite Speech 4.0-1B model locally with llama.cpp's new release.

Deep Dive

ggml-org released llama.cpp b9045, adding native support for IBM's Granite Speech model (granite-4.0-1b-speech). This brings state-of-the-art speech AI to local, offline environments without cloud dependencies. The model uses a Conformer encoder with Shaw relative position encoding, GLU gating, folded batch norm, and SSM depthwise convolution. A QFormer projector compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing converts raw waveforms to log-mel spectrograms using reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, and 2x frame stacking (80→160 mel features). The GGUF converter handles batch norm folding, fused K/V split, and Conv1d weight reshaping at export time.

The release has been validated against the HuggingFace transformers reference implementation, achieving token-for-token match on 30-second and 60-second audio clips with greedy decoding. Builds are available for macOS (Apple Silicon arm64, Intel x64), Linux (x64, arm64, s390x, with Vulkan, ROCm, OpenVINO, SYCL support), Windows (x64 CPU/GPU with CUDA, Vulkan, HIP), and iOS. This expands llama.cpp's multimodal capabilities, allowing developers and hobbyists to run Granite Speech entirely on local hardware for applications like voice assistants, transcription, and audio understanding.

Key Points
  • Adds Granite Speech 4.0-1B model with Conformer encoder, QFormer projector, and log-mel spectrogram preprocessing.
  • Audio pipeline includes reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, and 2× frame stacking.
  • Tested token-for-token with HuggingFace reference; supports CPU, Apple Silicon, and GPU (CUDA, Vulkan, ROCm).

Why It Matters

Enables on-device speech AI with Granite Speech, improving privacy and reducing cloud reliance for professionals.