MOSS-TTS v1.5 adds 11 languages and explicit pause control
Now supports 31 languages with language tags and inline pause markers like '[pause 3.2s]'
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
OpenMOSS has launched MOSS-TTS-v1.5, an upgraded version of their open-source text-to-speech model. The update focuses on three core improvements: expanded multilingual support, more reliable voice cloning, and fine-grained prosody control. The model now supports 31 languages, adding Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese to the original 20. When the language field is omitted, performance varies by language, but when explicitly tagged (e.g., language="French"), v1.5 outperforms v1.0 on nearly all languages.
Voice cloning in v1.5 is more stable—speaker similarity is improved and repeated generations show less variance. The model also handles long reference audio with short target text better than before. A standout addition is explicit pause control: users can insert inline markers like "[pause 3.2s]" to force timed silences (e.g., before a dramatic word). Punctuation-driven prosody is also more consistent, especially in long sentences. The team also released MOSS-SoundEffect-v2.0 for parallel sound effects generation.
- Supports 31 languages — adds 11 new including Cantonese, Hindi, Thai, and Swahili
- Explicit pause control with inline markers like "[pause 3.2s]" for timed breaks
- More stable voice cloning with reduced variance and improved speaker similarity
Why It Matters
Creators and developers now get multilingual, controllable TTS with precise timing for professional audio production.