Mistral Introduces "Voxtral TTS": An Open-Weight Text-to-Voice Model Capable Of Cloning Any Voice From 3 Seconds Of Audio, Runs In 9 Languages, & Beats Elevenlabs Flash V2.5 With A 68.4% Human Preference Win Rate.
Open-source voice cloning model captures human nuances like 'ums' and 'ahs' from just 3 seconds of audio.
Mistral AI has launched Voxtral TTS, a groundbreaking open-weight text-to-speech model that democratizes high-fidelity voice cloning. The 4-billion parameter model can replicate any voice from just three seconds of reference audio, capturing not only tone but also human nuances like accents, inflections, intonations, and vocal fillers (the 'ums' and 'ahs'). In benchmark tests, Voxtral achieved a 68.4% human preference win rate against the proprietary ElevenLabs Flash v2.5 model across all nine supported languages, while matching the more advanced ElevenLabs v3 on emotional expressiveness and quality.
Technically, Voxtral is designed for accessibility with a compact 3GB RAM requirement, enabling it to run locally on smartphones, laptops, and edge devices. It supports nine languages—English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic—and features cross-lingual cloning, meaning a French voice prompt can generate English speech seamlessly. With a model latency of just 70ms (matching ElevenLabs Flash v2.5's time-to-first-audio but at higher quality) and the weights freely available on Hugging Face, Mistral has effectively broken down the API lock-in and proprietary moat that companies like ElevenLabs have built.
This release represents a significant shift in the AI voice synthesis landscape, moving from closed, subscription-based services to open, locally deployable technology. For developers and businesses, it means unprecedented control over voice applications without recurring costs or data privacy concerns associated with cloud APIs.
- Clones any voice from just 3 seconds of audio with zero fine-tuning, capturing human vocal nuances
- Achieves 68.4% win rate against ElevenLabs Flash v2.5 and matches v3's quality across 9 languages
- Runs locally on 3GB RAM enabling smartphone and edge device deployment with 70ms latency
Why It Matters
Democratizes professional-grade voice cloning, breaking proprietary API lock-in and enabling private, cost-effective deployment for developers.