CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment
New dual-streaming TTS model replaces clunky alignment tools with a neural CTC aligner for 50% faster synthesis.
A research team has introduced CTC-TTS, a novel Large Language Model (LLM)-based text-to-speech system designed specifically for efficient, high-quality dual-streaming synthesis. The core innovation addresses two major bottlenecks in current streaming TTS: cumbersome alignment pipelines and inefficient token interleaving.
Technically, CTC-TTS replaces the traditional, pipeline-heavy GMM-HMM forced-alignment toolkits (like Montreal Forced Aligner, or MFA) with a more flexible and integrated Connectionist Temporal Classification (CTC) based neural aligner. This eliminates multiple processing steps. Furthermore, it abandons fixed-ratio interleaving of text and speech tokens for a novel 'bi-word' based strategy, which better captures the natural alignment regularities between text and audio. The team designed two model variants: CTC-TTS-L (for higher quality via token concatenation along the sequence length) and CTC-TTS-F (for lower latency via embedding stacking along the feature dimension).
In context, most LLM-based TTS systems prioritize ultimate quality over latency, making them unsuitable for real-time applications. High-quality streaming requires precise text-speech alignment and a training regimen that balances speed and fidelity—a challenge prior methods struggled with. The experiments show CTC-TTS outperforms both fixed-ratio interleaving and MFA-based baselines on key metrics for streaming synthesis and zero-shot voice tasks.
The practical implication is a significant step toward LLM-powered voices that are both natural-sounding and responsive enough for live interaction. This could enable more fluid conversational AI assistants, real-time audiobook generation, and accessible tech that speaks without perceptible lag, closing the gap between research-grade TTS and production-ready systems.
- Replaces MFA alignment with a neural CTC aligner, reducing pipeline complexity and increasing flexibility.
- Introduces 'bi-word' interleaving strategy, outperforming fixed-ratio methods for capturing text-speech alignment.
- Offers two variants: CTC-TTS-L for higher quality and CTC-TTS-F for 50% lower latency in streaming synthesis.
Why It Matters
Enables real-time, natural AI voices for live captions, conversational assistants, and interactive audiobooks without lag.