Open Source

NVIDIA Parakeet ported to ggml: 5x faster, no Python

Byte-for-byte identical to NeMo, up to 5x faster on GPU.

Deep Dive

A developer known as mudler_it has successfully ported NVIDIA's Parakeet speech-to-text models to pure C++ using the ggml runtime (the engine behind llama.cpp and whisper.cpp). The resulting library, parakeet.cpp, supports FastConformer architectures including TDT, CTC, RNNT, and hybrid models. Crucially, it eliminates all Python and PyTorch dependencies, running directly on CPU and GPU via CUDA, HIP, Vulkan, and Metal. The output is byte-for-byte identical to NVIDIA's NeMo framework on the f32/f16 path, achieving a perfect WER of 0.

Performance benchmarks show up to 5x speedup on GPU for larger TDT/hybrid models compared to NeMo's PyTorch runtime, and up to 1.86x on CPU when using quantized GGUF formats. Memory usage is halved. On a 23-second audio clip, the GPU can process the equivalent of one hour of audio in roughly six seconds – a 600x realtime factor. The release includes four quantized variants (q8_0, q6_k, q5_k, q4_k), cache-aware streaming with real-time end-of-utterance detection, and word-level timestamps with confidence scores. A minimal C-API and self-contained GGUF files (with baked-in tokenizer) make it embeddable anywhere, and it's already available as a LocalAI backend for OpenAI-compatible /v1/audio/transcriptions endpoints.

Key Points
  • Matches NVIDIA NeMo output exactly (WER 0 on f32/f16 path) while running up to 5x faster on GPU and 1.86x faster on quantized CPU.
  • Supports all major Parakeet model variants (FastConformer TDT, CTC, RNNT, hybrid) with GGUF quantization from f16 to q4_k.
  • Comes with cache-aware streaming, word-level timestamps, a small C-API, and ships as a LocalAI backend for local OpenAI-compatible speech recognition.

Why It Matters

Enables high-performance, local, embeddable speech recognition with no Python overhead, matching NVIDIA's accuracy.