PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing
New AI system from Korean researchers achieves more accurate lip-sync than human voice actors in cross-language dubbing.
A research team from Gwangju Institute of Science and Technology has developed PS-TTS, a novel text-to-speech system specifically designed for automated dubbing that synchronizes translated speech with on-screen lip movements. The system tackles two critical synchronization challenges: isochrony (matching speech duration) and phonetic synchronization (matching lip movements). Their approach uses a language model to paraphrase translated text for timing alignment, then employs dynamic time warping with vowel distance metrics to ensure target language vowels produce similar mouth shapes to source language vowels.
The researchers extended this to PS-Comet, which jointly optimizes for both phonetic similarity and semantic preservation using a multi-objective approach. In comprehensive evaluations across Korean, English, and French language pairs, both PS-TTS systems outperformed standard TTS without phonetic synchronization on objective metrics. Remarkably, the systems also surpassed professional voice actors in Korean-to-English and English-to-Korean dubbing tasks for lip-sync accuracy. PS-Comet demonstrated the best performance across all language pairs, effectively balancing lip synchronization with meaning preservation, confirming its superiority over phonetic-only approaches.
The technology represents a significant advancement in automated dubbing, addressing what has been a persistent challenge in cross-language media localization. By achieving better lip synchronization than human professionals while maintaining semantic accuracy, PS-TTS could dramatically reduce the cost and time required for high-quality dubbing of films, television shows, and educational content. The system's demonstrated effectiveness across multiple language pairs suggests broad applicability for global media distribution.
- PS-TTS uses phonetic synchronization with dynamic time warping to match vowel mouth shapes across languages
- PS-Comet variant jointly optimizes for both lip-sync accuracy and semantic preservation using multi-objective training
- Outperformed professional voice actors in Korean-English dubbing tasks and worked across Korean, English, and French language pairs
Why It Matters
Enables cheaper, faster, and more natural automated dubbing for global media while maintaining lip synchronization better than human professionals.