Audio & Speech

GibbsTTS uses kinetic-optimal scheduling for superior zero-shot speech synthesis

New AI method achieves best naturalness and speaker similarity in TTS tests.

Deep Dive

A team of researchers from the University of Tokyo (Dong Yang, Yiyi Cai, Haoyu Zhang, Yuki Saito, Hiroshi Saruwatari) has published GibbsTTS, a novel zero-shot text-to-speech (TTS) system that overcomes two fundamental limitations of metric-induced discrete flow matching (MI-DFM). MI-DFM exploits token-latent geometry for discrete generation but suffers from heuristic schedulers that require hyperparameter tuning and finite-step path-tracking errors from its first-order continuous-time Markov chain solver. GibbsTTS introduces a kinetic-optimal scheduler that traverses probability paths at constant Fisher-Rao speed without additional training, and a finite-step moment correction that adjusts jump probabilities while preserving the jump destination distribution.

In extensive experiments using a unified architecture and large-scale dataset, GibbsTTS outperformed all evaluated baselines. It achieved the best objective naturalness scores and was preferred in subjective listening tests over masked discrete generative models. When compared with state-of-the-art TTS systems, GibbsTTS demonstrated strong speaker similarity—ranking first on three out of four test sets and second on the remaining one. The paper is currently under review and has been published on arXiv (2605.09386). A project page with code and demos is available.

Key Points
  • GibbsTTS uses a training-free kinetic-optimal scheduler that maintains constant Fisher-Rao speed, eliminating hyperparameter search.
  • A finite-step moment correction adjusts jump probabilities while preserving the CTMC jump destination distribution.
  • Achieves best objective naturalness and top speaker similarity on 3 of 4 test sets vs state-of-the-art TTS systems.

Why It Matters

GibbsTTS delivers speaker-adaptive, natural-sounding speech from text, advancing zero-shot TTS for real-world voice cloning and synthesis.