Trained on 361k hours of speech with balanced Japanese and English data for cross-lingual robustness?

Trained on 361k hours of speech with balanced Japanese and English data for cross-lingual robustness.

Targeted data augmentation covers all 2,136 Joyo kanji, yielding state-of-the-art reading accuracy on the new Joyo Kanji Yomi Benchmark?

Targeted data augmentation covers all 2,136 Joyo kanji, yielding state-of-the-art reading accuracy on the new Joyo Kanji Yomi Benchmark.

Introduces Kana-CER metric to directly measure pronunciation correctness by eliminating orthographic variation?

Introduces Kana-CER metric to directly measure pronunciation correctness by eliminating orthographic variation.

Audio & Speech

Sarashina2.2-TTS cracks Japanese kanji polyphony with 361k hours of data

arXiv eess.AS June 26, 2026

⚡A new TTS model reads all 2,136 Joyo kanji with state-of-the-art accuracy.

Deep Dive

Researchers from multiple institutions have unveiled Sarashina2.2-TTS, a Japanese-focused text-to-speech system built on large language models. The paper, posted on arXiv, addresses a longstanding challenge in Japanese speech synthesis: kanji polyphony, where a single kanji character has multiple possible readings depending on context. The team trained the model on approximately 361,000 hours of speech data, combining Japanese and English to improve cross-lingual robustness.

To specifically tackle polyphony, they designed a data augmentation pipeline covering all 2,136 Joyo kanji (regular-use characters designated by Japan's Agency for Cultural Affairs). They also introduce the Joyo Kanji Yomi Benchmark, which includes 4,378 readings, and a new metric called Kana-CER that measures pronunciation correctness by comparing synthesized speech to reference readings in kana space. Experiments show Sarashina2.2-TTS achieves state-of-the-art kanji-level reading accuracy, matches top baselines on general sentence pronunciation, and delivers the highest speaker similarity in zero-shot Japanese TTS. Notably, it is the only system that maintains stable Japanese pronunciation regardless of prompt language, proving the effectiveness of balanced bilingual training.

Key Points

Trained on 361k hours of speech with balanced Japanese and English data for cross-lingual robustness.
Targeted data augmentation covers all 2,136 Joyo kanji, yielding state-of-the-art reading accuracy on the new Joyo Kanji Yomi Benchmark.
Introduces Kana-CER metric to directly measure pronunciation correctness by eliminating orthographic variation.

Why It Matters

Sarashina2.2-TTS solves a critical Japanese TTS bottleneck, enabling accurate and natural-sounding speech for any kanji-based text.

Read Original Article

Sarashina2.2-TTS cracks Japanese kanji polyphony with 361k hours of data

Why It Matters

Related Articles

🚀 Stay Ahead in AI