Audio & Speech

WavTTS matches latent TTS quality by modeling raw waveforms directly

First diffusion TTS model to work directly on raw audio, not compressed representations

Deep Dive

WavTTS, developed by a team of researchers from multiple institutions, tackles a long-standing challenge in text-to-speech: generating high-quality speech directly from raw waveforms without relying on compressed intermediate representations like mel-spectrograms or VAE latents. Current zero-shot TTS models dominate using diffusion on these compressed spaces for efficiency, but they suffer from information loss and non-end-to-end training. WavTTS is the first to demonstrate that scaling diffusion-based TTS in the waveform space can approach the quality of latent-space models. It builds on flow matching with Diffusion Transformer (DiT) and uses a simple patchification strategy to manage the extremely long sequence length of raw audio. Additionally, multi-scale mel-spectrogram supervision provides perceptual guidance during training, while careful design of prediction targets and noise scheduling further improves generation quality.

Evaluated on open-source benchmarks, WavTTS closely matches the performance of current state-of-the-art latent generative zero-shot TTS models, and significantly outperforms previous end-to-end wavefrom generation models. This work proves that direct waveform modeling for TTS is not only feasible but can achieve competitive results, paving the way for truly end-to-end speech generation with fewer artifacts and higher fidelity. The findings suggest that future TTS systems may move away from compressed representations entirely, simplifying pipelines and improving audio quality.

Key Points
  • First raw waveform diffusion TTS model using flow matching with DiT and patchification to handle long audio sequences
  • Integrates multi-scale mel-spectrogram supervision to guide training without losing end-to-end nature
  • Benchmarks show it matches SOTA latent generative models and outperforms previous end-to-end waveform models

Why It Matters

Enables higher fidelity, truly end-to-end TTS without information loss from compressed representations.