An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding
A novel AI speech model cuts latency to 49ms, making AI voices feel truly real-time.
A team of researchers including Tianhui Su, Tien-Ping Tan, and Salima Mdhaffar has unveiled a groundbreaking architecture for ultra-low latency speech synthesis. Published on arXiv, their work tackles the core bottleneck in real-time text-to-speech (TTS): the high computational cost of neural vocoders used to reconstruct audio. Instead of the conventional continuous pipeline, their novel, non-autoregressive model directly generates speech in the highly compressed discrete latent space of the Mimi neural audio codec. This fundamental shift eliminates the need for slow, separate vocoder steps.
The architecture integrates a modified FastSpeech 2 backbone with a new 'progressive depth-wise sequential decoding' strategy. This method dynamically conditions 32 layers of residual vector quantization codes, solving phonetic alignment issues and managing complex audio representations without autoregressive overhead. The result is a system that is not only 10.6 times faster than traditional cascaded TTS pipelines but also produces higher quality audio, with improved fundamental voicing accuracy and reduced spectral artifacts. Crucially, it achieves an average latency of just 48.99 milliseconds from request to first audio byte, a speed demonstrably below the human threshold for perceiving delay in real-time interaction.
Tested on both English and Malay datasets, the architecture shows strong language-agnostic performance, paving the way for globally deployable, real-time voice interfaces. This breakthrough moves us from 'fast' AI speech to 'instantaneous' AI speech, where the response feels immediate and natural, unlocking new paradigms for live translation, interactive assistants, and immersive gaming and metaverse experiences.
- Achieves ultra-low latency of 48.99ms time-to-first-byte, below the human perception threshold for real-time interaction.
- Uses a novel depth-wise decoding strategy on Mimi codec's latent space, making it 10.6x faster than conventional TTS pipelines.
- Demonstrates language-agnostic capability with tests on English and Malay, improving audio quality by reducing spectral over-smoothing.
Why It Matters
Enables truly real-time, natural conversations with AI assistants, live translators, and game characters by eliminating perceptible audio lag.