Audio & Speech

T5Gemma-TTS uses 4B encoder-decoder model for superior multilingual voice cloning

New TTS model beats XTTSv2 in speaker similarity and handles unseen languages like Korean.

Deep Dive

Researchers Chihiro Arata and Kiyoshi Kurihara have introduced T5Gemma-TTS, a novel text-to-speech model that tackles a key weakness in current voice cloning systems. While autoregressive models are powerful, their decoder-only architecture forces input text to compete with the generated audio for positional context, weakening text conditioning over long sentences. T5Gemma-TTS solves this by using a 4-billion-parameter encoder-decoder backbone (2B encoder + 2B decoder) based on the T5Gemma model. This architecture allows persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer, processing text directly at the subword level without needing phoneme conversion.

A key innovation is the Progress-Monitoring Rotary Position Embedding (PM-RoPE), injected into all 26 cross-attention layers. This gives the decoder a normalized progress signal to better track target speech length, dramatically improving duration control. Trained on a massive 170,000-hour multilingual dataset (English, Chinese, Japanese), the model achieves a statistically significant speaker-similarity gain over XTTSv2 on Japanese (0.677 vs. 0.622). Remarkably, it also shows the highest numerical similarity for Korean (0.747), a language not included in its training, though this margin over XTTSv2 is not statistically conclusive. The model's reliance on PM-RoPE is stark: disabling it at inference causes near-total synthesis failure, with character error rate degrading from 0.129 to 0.982.

Key Points
  • Uses a 4B parameter T5Gemma encoder-decoder backbone for persistent text conditioning, avoiding the long-context weakness of decoder-only models.
  • Introduces PM-RoPE for duration control; disabling it crashes performance (CER jumps from 0.129 to 0.982).
  • Trained on 170K hours of speech, it beats XTTSv2 on Japanese similarity and shows strong zero-shot ability on unseen Korean.

Why It Matters

Enables more accurate, long-form voice cloning for professional dubbing, audiobooks, and assistive tech, with robust multilingual support.

📬 Get the top 10 AI stories daily