LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space
A new diffusion model bypasses traditional spectrograms, achieving a 0.818 speaker similarity score for zero-shot cloning.
Meituan's LongCat-TTS research team has introduced a significant architectural shift in text-to-speech synthesis. Instead of relying on intermediate acoustic representations like mel-spectrograms—a standard that can introduce compounding errors—LongCat-TTS operates directly within a learned waveform latent space. This streamlined approach requires only two core components: a waveform variational autoencoder (Wav-VAE) to compress audio and a diffusion model backbone to generate it. The team also introduced key inference improvements, fixing a training-inference mismatch and replacing classifier-free guidance with a new adaptive projection guidance technique to boost quality.
Despite forgoing complex, multi-stage training pipelines or high-quality annotated datasets, the model achieves state-of-the-art performance. The largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS) on zero-shot voice cloning benchmarks. It improved speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH and from 0.776 to 0.797 on the more challenging Seed-Hard set. In a counterintuitive finding from their ablation studies, the researchers discovered that superior reconstruction fidelity in the Wav-VAE does not necessarily translate to better overall TTS performance, highlighting the complex interplay between model components. All model weights (including 1B, 3.5B, and quantized versions) and code have been open-sourced to accelerate community research.
- Operates directly in waveform latent space, bypassing error-prone mel-spectrograms to simplify the TTS pipeline.
- The 3.5B parameter model sets a new SOTA for zero-shot voice cloning, boosting speaker similarity scores on key benchmarks.
- Open-sources all code and model weights (1B & 3.5B params) on Hugging Face, with ComfyUI integration for easy use.
Why It Matters
Delivers more accurate and natural-sounding voice cloning with a simpler, more efficient architecture, advancing open-source AI speech tools.