Open Source

Releasing MioTTS: A family of lightweight, fast LLM-based TTS models (0.1B - 2.6B) with Zero-shot Voice Cloning

A developer just released lightweight TTS models that clone voices instantly from short audio clips.

Deep Dive

A developer has open-sourced MioTTS, a family of lightweight LLM-based text-to-speech models ranging from 0.1B to 2.6B parameters. The key feature is zero-shot voice cloning from short reference audio, achieving high fidelity even at the smallest 0.1B scale. It's bilingual (English/Japanese), trained on ~100k hours of speech, and uses a custom neural audio codec (MioCodec) for fast generation with latencies as low as 0.04 Real-Time Factor.

Why It Matters

This makes high-quality, instant voice cloning accessible on consumer hardware, potentially disrupting voice synthesis and content creation tools.