Audio & Speech

T5Gemma-TTS Technical Report

New TTS model beats XTTSv2 in speaker similarity and handles unseen languages like Korean.

Deep Dive

Researchers Chihiro Arata and Kiyoshi Kurihara have introduced T5Gemma-TTS, a novel text-to-speech model that tackles a key weakness in current voice cloning systems. While autoregressive models are powerful, their decoder-only architecture forces input text to compete with the generated audio for positional context, weakening text conditioning over long sentences. T5Gemma-TTS solves this by using a 4-billion-parameter encoder-decoder backbone (2B encoder + 2B decoder) based on the T5Gemma model. This architecture allows persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer, processing text directly at the subword level without needing phoneme conversion.

A key innovation is the Progress-Monitoring Rotary Position Embedding (PM-RoPE), injected into all 26 cross-attention layers. This gives the decoder a normalized progress signal to better track target speech length, dramatically improving duration control. Trained on a massive 170,000-hour multilingual dataset (English, Chinese, Japanese), the model achieves a statistically significant speaker-similarity gain over XTTSv2 on Japanese (0.677 vs. 0.622). Remarkably, it also shows the highest numerical similarity for Korean (0.747), a language not included in its training, though this margin over XTTSv2 is not statistically conclusive. The model's reliance on PM-RoPE is stark: disabling it at inference causes near-total synthesis failure, with character error rate degrading from 0.129 to 0.982.

Key Points
  • Uses a 4B parameter T5Gemma encoder-decoder backbone for persistent text conditioning, avoiding the long-context weakness of decoder-only models.
  • Introduces PM-RoPE for duration control; disabling it crashes performance (CER jumps from 0.129 to 0.982).
  • Trained on 170K hours of speech, it beats XTTSv2 on Japanese similarity and shows strong zero-shot ability on unseen Korean.

Why It Matters

Enables more accurate, long-form voice cloning for professional dubbing, audiobooks, and assistive tech, with robust multilingual support.