LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
Researchers' new diffusion model skips mel-spectrograms entirely, achieving state-of-the-art voice cloning with a simplified two-component pipeline.
A research team led by Detai Xin has introduced LongCat-AudioDiT, a groundbreaking text-to-speech model that discards the industry-standard mel-spectrogram pipeline. Instead, it operates directly within a learned waveform latent space, using just two components: a waveform variational autoencoder (Wav-VAE) and a diffusion transformer backbone. This architectural shift aims to mitigate the compounding errors inherent in multi-stage TTS systems and drastically simplifies the training and inference process. The team also introduced two key inference improvements: rectifying a training-inference mismatch and replacing classifier-free guidance with a novel adaptive projection guidance to boost quality.
Experimental results are compelling. The largest model variant, LongCat-AudioDiT-3.5B, achieved state-of-the-art zero-shot voice cloning performance on the challenging Seed benchmark. It improved speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH and from 0.776 to 0.797 on Seed-Hard, surpassing the previous leader, Seed-TTS. Notably, the researchers made a counterintuitive discovery: a Wav-VAE with superior reconstruction fidelity does not necessarily translate to better final TTS output, highlighting complex trade-offs in latent space design. The team has released the code and model weights publicly, inviting further development in the speech AI community.
- Operates directly in waveform latent space, eliminating need for error-prone mel-spectrogram intermediates.
- The 3.5B parameter model scores 0.818 SIM on Seed-ZH, beating previous SOTA model Seed-TTS.
- Uses a simplified two-component pipeline (Wav-VAE + Diffusion) and introduces adaptive projection guidance for inference.
Why It Matters
Simplifies high-fidelity voice cloning pipelines, enabling more realistic and efficient synthetic speech for content creation and assistive tech.