Open Source

audio.cpp: 12 audio models in one C++ runtime, up to 5x faster than Python

PocketTTS generates 5 minutes of audio in 7 seconds — 48x real time

Deep Dive

audio.cpp is a native C++ inference framework for audio models built on ggml. It currently supports 12 released models out of 25 model families, covering TTS/voice cloning (Chatterbox, MioTTS, OmniVoice, PocketTTS, Qwen3-TTS, VoxCPM2), ASR/alignment/VAD (Qwen3-ASR, Qwen3 Forced Aligner, Silero VAD), and voice conversion/codec/editing (Seed-VC, MioCodec, Vevo2, which also handles TTS and singing generation). The project's core goal is to eliminate the fragmentation of separate Python environments, dependency trees, and deployment setups by providing a unified runtime, session handling, CLI, and server. Python is used only for downloading and converting model packages, while all inference paths are native C++.

Performance benchmarks on Ubuntu/CUDA with original weights show significant speedups: PocketTTS achieves 3.68x faster on 1-shot runs, Qwen3-TTS 2.74x in warm sessions, and Vevo2 5.03x on 1-shot. Long-form throughput is even more impressive: PocketTTS generates 5 minutes 53 seconds of audio in 7.30 seconds (48.40x real time), and all released TTS families run faster than real time (4.34x to 48.40x). The developer emphasizes warm-session numbers as most relevant for production. A sample redubbing pipeline runs a 418-second recording through ASR and TTS in one CLI command. The framework is still early: backend coverage varies by model, and streaming is not yet generally supported, keeping current paths offline.

Key Points
  • audio.cpp runs 12 audio models including Qwen3-TTS, PocketTTS, Vevo2, and Seed-VC in a single C++ runtime built on ggml.
  • Performance gains: up to 5.03x faster than Python on CUDA, with PocketTTS achieving 48.40x real-time long-form generation.
  • Shared CLI and native C++ inference eliminate separate Python environments; Python only used for model download/convert.

Why It Matters

Unifies fragmented audio AI deployment with C++ performance, enabling faster, simpler integration of TTS and ASR into production.

📬 Get the top 10 AI stories daily