Audio & Speech

Simultaneous Speech-to-Speech Translation Without Aligned Data

This new model breaks a major bottleneck for real-time, multilingual translation.

Deep Dive

Researchers introduced Hibiki-Zero, a model for simultaneous speech-to-speech translation that eliminates the need for hard-to-collect, word-level aligned training data. Using a novel reinforcement learning strategy, it achieves state-of-the-art performance in accuracy, latency, and voice quality across five language tasks. Crucially, it can be adapted to support a new input language with less than 1,000 hours of speech, fundamentally simplifying the scaling of real-time translation to diverse languages.

Why It Matters

It removes a major data bottleneck, enabling faster and cheaper development of high-quality, real-time translation for countless languages.