KaniTTS2 — open-source 400M TTS model with voice cloning, runs in 3GB VRAM. Pretrain code included.
This 400M-parameter model could democratize voice AI for any language.
Deep Dive
The open-source KaniTTS2 text-to-speech model has been released, featuring real-time voice cloning and multilingual support. The 400M-parameter model runs on just 3GB of VRAM with a 0.2 real-time factor on high-end GPUs, and was trained on 10k hours of speech data in just 6 hours using 8x H100 GPUs. Critically, the team is releasing the complete pretraining code, allowing anyone to train custom TTS models for specific languages or accents.
Why It Matters
It enables developers and communities to create high-quality, localized voice AI without massive computational resources or proprietary platforms.