KaniTTS2, our text-to-speech model with frame-level position encodings, optimized for real-time conversational AI.
This open-source TTS model trains in just 6 hours and clones voices instantly.
Deep Dive
KaniTTS2 is a new 400M-parameter text-to-speech model optimized for real-time conversational AI. It achieves a 0.2 Real-Time Factor on an RTX 5080 using just 3GB VRAM, making it fast enough for live applications. The model supports voice cloning and was pretrained on 10k hours of speech data in only 6 hours using 8x H100 GPUs. It's multilingual (English, Spanish, Kyrgyz) and the full pretraining code is released under Apache 2.0.
Why It Matters
This dramatically lowers the barrier for creating custom, real-time voice AI in any language or accent.