audio.cpp runs 12 audio models including Qwen3-TTS, PocketTTS, Vevo2, and Seed-VC in a single C++ runtime built on ggml?

audio.cpp runs 12 audio models including Qwen3-TTS, PocketTTS, Vevo2, and Seed-VC in a single C++ runtime built on ggml.

up to 5.03x faster than Python on CUDA, with PocketTTS achieving 48.40x real-time long-form generation.

Shared CLI and native C++ inference eliminate separate Python environments; Python only used for model download/convert?

Shared CLI and native C++ inference eliminate separate Python environments; Python only used for model download/convert.

Open Source

audio.cpp: 12 audio models in one C++ runtime, up to 5x faster than Python

r/LocalLLaMA June 26, 2026

⚡PocketTTS generates 5 minutes of audio in 7 seconds — 48x real time

Deep Dive

audio.cpp is a native C++ inference framework for audio models built on ggml. It currently supports 12 released models out of 25 model families, covering TTS/voice cloning (Chatterbox, MioTTS, OmniVoice, PocketTTS, Qwen3-TTS, VoxCPM2), ASR/alignment/VAD (Qwen3-ASR, Qwen3 Forced Aligner, Silero VAD), and voice conversion/codec/editing (Seed-VC, MioCodec, Vevo2, which also handles TTS and singing generation). The project's core goal is to eliminate the fragmentation of separate Python environments, dependency trees, and deployment setups by providing a unified runtime, session handling, CLI, and server. Python is used only for downloading and converting model packages, while all inference paths are native C++.

Performance benchmarks on Ubuntu/CUDA with original weights show significant speedups: PocketTTS achieves 3.68x faster on 1-shot runs, Qwen3-TTS 2.74x in warm sessions, and Vevo2 5.03x on 1-shot. Long-form throughput is even more impressive: PocketTTS generates 5 minutes 53 seconds of audio in 7.30 seconds (48.40x real time), and all released TTS families run faster than real time (4.34x to 48.40x). The developer emphasizes warm-session numbers as most relevant for production. A sample redubbing pipeline runs a 418-second recording through ASR and TTS in one CLI command. The framework is still early: backend coverage varies by model, and streaming is not yet generally supported, keeping current paths offline.

Key Points

audio.cpp runs 12 audio models including Qwen3-TTS, PocketTTS, Vevo2, and Seed-VC in a single C++ runtime built on ggml.
Performance gains: up to 5.03x faster than Python on CUDA, with PocketTTS achieving 48.40x real-time long-form generation.
Shared CLI and native C++ inference eliminate separate Python environments; Python only used for model download/convert.

Why It Matters

Unifies fragmented audio AI deployment with C++ performance, enabling faster, simpler integration of TTS and ASR into production.

Read Original Article

audio.cpp: 12 audio models in one C++ runtime, up to 5x faster than Python

Why It Matters

Related Articles

🚀 Stay Ahead in AI