Audio & Speech

RobustSpeechFlow boosts TTS accuracy by fixing alignment errors

New technique cuts word error rates without extra data or aligners

Deep Dive

Flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, but remains vulnerable to skip and repeat errors due to imperfect alignment. Jinhyeok Yang et al. introduce RobustSpeechFlow, a training strategy that extends contrastive flow matching with length-preserving repeat and skip latent augmentations. This directly penalizes realistic failure modes without needing external aligners or preference data, making it easy to drop into existing TTS pipelines. The approach targets the content fidelity issues that plague current state-of-the-art models, offering a lightweight fix with only 0.06 billion parameters.

On the Seed-TTS-eval benchmark, RobustSpeechFlow reduced Word Error Rate (WER) from 1.44 to 1.38. On the authors' new ZERO500 benchmark, which tests across diverse speaker and prosody conditions, it delivered consistent improvements: at 24 neural function evaluations (NFE=24), English Character Error Rate (CER) dropped from 0.48% to 0.35%, and Korean CER fell from 0.81% to 0.57%. These results demonstrate stronger multilingual robustness without sacrificing speed or quality. The paper is submitted to INTERSPEECH 2026, with audio samples available online. This work shows that careful augmentation strategies can fix alignment errors without complex external modules, a practical win for production TTS systems.

Key Points
  • Reduces WER from 1.44 to 1.38 on Seed-TTS-eval with only 0.06B parameters
  • English CER cut from 0.48% to 0.35%, Korean CER from 0.81% to 0.57% on ZERO500 benchmark
  • Requires no external aligners or preference data; integrates into existing flow-matching pipelines

Why It Matters

More robust zero-shot TTS for multilingual applications, reducing garbled output without extra complexity.