Audio & Speech

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

arXiv eess.AS April 01, 2026

⚡Researchers' new diffusion model skips mel-spectrograms entirely, achieving state-of-the-art voice cloning with a simplified two-component pipeline.

Deep Dive

A research team led by Detai Xin has introduced LongCat-AudioDiT, a groundbreaking text-to-speech model that discards the industry-standard mel-spectrogram pipeline. Instead, it operates directly within a learned waveform latent space, using just two components: a waveform variational autoencoder (Wav-VAE) and a diffusion transformer backbone. This architectural shift aims to mitigate the compounding errors inherent in multi-stage TTS systems and drastically simplifies the training and inference process. The team also introduced two key inference improvements: rectifying a training-inference mismatch and replacing classifier-free guidance with a novel adaptive projection guidance to boost quality.

Experimental results are compelling. The largest model variant, LongCat-AudioDiT-3.5B, achieved state-of-the-art zero-shot voice cloning performance on the challenging Seed benchmark. It improved speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH and from 0.776 to 0.797 on Seed-Hard, surpassing the previous leader, Seed-TTS. Notably, the researchers made a counterintuitive discovery: a Wav-VAE with superior reconstruction fidelity does not necessarily translate to better final TTS output, highlighting complex trade-offs in latent space design. The team has released the code and model weights publicly, inviting further development in the speech AI community.

Key Points

Operates directly in waveform latent space, eliminating need for error-prone mel-spectrogram intermediates.
The 3.5B parameter model scores 0.818 SIM on Seed-ZH, beating previous SOTA model Seed-TTS.
Uses a simplified two-component pipeline (Wav-VAE + Diffusion) and introduces adaptive projection guidance for inference.

Why It Matters

Simplifies high-fidelity voice cloning pipelines, enabling more realistic and efficient synthetic speech for content creation and assistive tech.

Read Original Article

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

Why It Matters

Stay Ahead in AI