Mistral releases a new open-source model for speech generation
The new model can run on a smartwatch, costs a fraction of competitors, and adapts a voice in under 5 seconds.
Mistral AI has entered the competitive speech generation market with Voxtral TTS, an open-source text-to-speech model designed for enterprise applications like voice assistants and customer support. The model's key selling point is its compact size, allowing it to run on edge devices from smartwatches to laptops, and its significantly lower cost compared to proprietary alternatives. It supports nine languages—English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic—and can adapt a custom voice using less than five seconds of audio sample, capturing subtle accents and speech patterns.
Technically, Voxtral TTS is built for real-time performance, boasting a rapid time-to-first-audio (TTFA) of 90 milliseconds and a real-time factor (RTF) of 6x, meaning it can generate a 10-second audio clip in about 1.6 seconds. Based on the Ministral 3B architecture, it can also switch between languages mid-stream without losing voice characteristics, a feature useful for dubbing or real-time translation. Mistral's VP of Science Operations, Pierre Stock, emphasized the goal was to create a model that sounds human, not robotic.
This release is part of Mistral's broader strategy to build a full suite of voice AI products, complementing its earlier transcription models. The company is positioning its open-source approach and customization capabilities as a major advantage for enterprises that want to fine-tune models for specific use cases. Ultimately, Mistral aims to create an end-to-end, multimodal platform that can handle audio, text, and image inputs and outputs within a single, agentic system.
- Compact & Cost-Effective: Model is small enough to run on edge devices like smartwatches and is priced at a fraction of competitors like ElevenLabs and OpenAI.
- Multilingual & Adaptive: Supports 9 languages and can clone a custom voice with less than 5 seconds of audio, capturing accents and intonations.
- Built for Real-Time: Features a 90ms time-to-first-audio and a 6x real-time factor, enabling fast generation for live applications like customer support agents.
Why It Matters
Lowers the barrier for enterprises to deploy customizable, multilingual voice AI at the edge, challenging closed-source providers on cost and flexibility.