Mistral AI to release Voxtral TTS, a 3-billion-parameter text-to-speech model with open weights that the company says outperformed ElevenLabs Flash v2.5 in human preference tests. The model runs on about 3 GB of RAM, achieves 90-millisecond time-to-first-audio, supports nine languages.
A free, 3-billion-parameter model that runs on 3GB RAM and supports nine languages, outperforming ElevenLabs Flash v2.5.
Mistral AI has entered the text-to-speech arena with Voxtral TTS, a powerful open-weights model that directly challenges established players. The company claims its 3-billion-parameter model outperformed ElevenLabs' popular Flash v2.5 in human preference evaluations. This release marks a significant move by Mistral to expand beyond its core large language models (LLMs) into the competitive generative audio space, offering a high-quality alternative that is freely accessible to developers and researchers.
Technically, Voxtral TTS is designed for efficiency and speed. It requires only about 3 GB of RAM to run, making it deployable on more modest hardware, and boasts an impressively low 90-millisecond latency for time-to-first-audio. The model supports speech synthesis in nine languages, broadening its potential application for global products. By releasing the model weights for free, Mistral is following its established open-source philosophy, which could accelerate innovation and lower barriers to entry for high-quality TTS technology.
The launch signifies a strategic expansion for Mistral AI and intensifies competition in the voice AI market. For developers, it provides a compelling, cost-effective alternative to proprietary APIs, enabling more control over deployment and data privacy. The model's performance claims, if validated by the community, could pressure commercial providers to improve their offerings or adjust pricing, ultimately giving users more choice and potentially higher-quality, more affordable speech synthesis tools.
- Voxtral TTS is a 3-billion-parameter open-weights model that Mistral claims beat ElevenLabs Flash v2.5 in tests.
- It runs efficiently on ~3GB of RAM with a 90ms time-to-first-audio and supports nine languages.
- Mistral is releasing the model weights for free, challenging paid TTS services with a high-performance open-source alternative.
Why It Matters
This provides a free, high-performance alternative to paid TTS APIs, giving developers more control and potentially lowering costs for voice-enabled applications.