Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion
New AI speech model beats Grad-TTS on word error rate and adds natural pauses automatically.
A team of researchers has introduced a novel framework for text-to-speech (TTS) synthesis that tackles a core challenge in AI voice generation. Current diffusion and flow matching models struggle to balance discrete temporal structure (like syllable timing) with continuous spectral modeling (sound quality). Two-stage models often produce flat, monotone speech, while single-stage models suffer from unstable word alignment. The proposed 'Jump Diffusion' framework elegantly solves this by integrating both processes: discrete jumps handle the structural timing of speech, while continuous diffusion refines the acoustic content, all within a unified model.
The results are quantitatively and qualitatively impressive. In its one-shot form, the model achieved a word error rate (WER) of 3.37% on the standard LJSpeech benchmark, a significant improvement over the 4.38% WER of the established Grad-TTS model, alongside better audio quality scores (UTMOSv2). More notably, the full iterative variant (UDD) enables adaptive prosody generation. This means the model can intelligently modify speech rhythm for out-of-distribution scenarios, like autonomously inserting natural-sounding pauses in artificially slow speech instead of unnaturally stretching every sound, leading to more human-like and expressive audio output.
- Unified 'Jump Diffusion' framework combines timing (structure) and sound (content) refinement in one process.
- Achieves a 3.37% Word Error Rate on LJSpeech, outperforming Grad-TTS (4.38%) with better audio quality.
- Enables adaptive prosody, allowing the AI to insert natural pauses in slow speech instead of uniform stretching.
Why It Matters
This research could lead to more expressive, natural-sounding AI voices for assistants, audiobooks, and content creation.