First raw waveform diffusion TTS model using flow matching with DiT and patchification to handle long audio sequences?

First raw waveform diffusion TTS model using flow matching with DiT and patchification to handle long audio sequences

Integrates multi-scale mel-spectrogram supervision to guide training without losing end-to-end nature?

Integrates multi-scale mel-spectrogram supervision to guide training without losing end-to-end nature

Benchmarks show it matches SOTA latent generative models and outperforms previous end-to-end waveform models?

Benchmarks show it matches SOTA latent generative models and outperforms previous end-to-end waveform models

Audio & Speech

WavTTS matches latent TTS quality by modeling raw waveforms directly

arXiv eess.AS June 03, 2026

⚡First diffusion TTS model to work directly on raw audio, not compressed representations

Deep Dive

WavTTS, developed by a team of researchers from multiple institutions, tackles a long-standing challenge in text-to-speech: generating high-quality speech directly from raw waveforms without relying on compressed intermediate representations like mel-spectrograms or VAE latents. Current zero-shot TTS models dominate using diffusion on these compressed spaces for efficiency, but they suffer from information loss and non-end-to-end training. WavTTS is the first to demonstrate that scaling diffusion-based TTS in the waveform space can approach the quality of latent-space models. It builds on flow matching with Diffusion Transformer (DiT) and uses a simple patchification strategy to manage the extremely long sequence length of raw audio. Additionally, multi-scale mel-spectrogram supervision provides perceptual guidance during training, while careful design of prediction targets and noise scheduling further improves generation quality.

Evaluated on open-source benchmarks, WavTTS closely matches the performance of current state-of-the-art latent generative zero-shot TTS models, and significantly outperforms previous end-to-end wavefrom generation models. This work proves that direct waveform modeling for TTS is not only feasible but can achieve competitive results, paving the way for truly end-to-end speech generation with fewer artifacts and higher fidelity. The findings suggest that future TTS systems may move away from compressed representations entirely, simplifying pipelines and improving audio quality.

Key Points

First raw waveform diffusion TTS model using flow matching with DiT and patchification to handle long audio sequences
Integrates multi-scale mel-spectrogram supervision to guide training without losing end-to-end nature
Benchmarks show it matches SOTA latent generative models and outperforms previous end-to-end waveform models

Why It Matters

Enables higher fidelity, truly end-to-end TTS without information loss from compressed representations.

Read Original Article

WavTTS matches latent TTS quality by modeling raw waveforms directly

Why It Matters

Related Articles

🚀 Stay Ahead in AI