WavTTS matches latent TTS quality by modeling raw waveforms directly
First diffusion TTS model to work directly on raw audio, not compressed representations
WavTTS, developed by a team of researchers from multiple institutions, tackles a long-standing challenge in text-to-speech: generating high-quality speech directly from raw waveforms without relying on compressed intermediate representations like mel-spectrograms or VAE latents. Current zero-shot TTS models dominate using diffusion on these compressed spaces for efficiency, but they suffer from information loss and non-end-to-end training. WavTTS is the first to demonstrate that scaling diffusion-based TTS in the waveform space can approach the quality of latent-space models. It builds on flow matching with Diffusion Transformer (DiT) and uses a simple patchification strategy to manage the extremely long sequence length of raw audio. Additionally, multi-scale mel-spectrogram supervision provides perceptual guidance during training, while careful design of prediction targets and noise scheduling further improves generation quality.
Evaluated on open-source benchmarks, WavTTS closely matches the performance of current state-of-the-art latent generative zero-shot TTS models, and significantly outperforms previous end-to-end wavefrom generation models. This work proves that direct waveform modeling for TTS is not only feasible but can achieve competitive results, paving the way for truly end-to-end speech generation with fewer artifacts and higher fidelity. The findings suggest that future TTS systems may move away from compressed representations entirely, simplifying pipelines and improving audio quality.
- First raw waveform diffusion TTS model using flow matching with DiT and patchification to handle long audio sequences
- Integrates multi-scale mel-spectrogram supervision to guide training without losing end-to-end nature
- Benchmarks show it matches SOTA latent generative models and outperforms previous end-to-end waveform models
Why It Matters
Enables higher fidelity, truly end-to-end TTS without information loss from compressed representations.