Fish Audio Releases S2: open-source, controllable and expressive TTS model
Open-source S2 model generates expressive, multi-speaker dialogue in 100ms, supporting 80+ languages.
Fish Audio has launched S2, a fully open-source text-to-speech (TTS) model that challenges the dominance of proprietary services from tech giants. The model's standout feature is its precise, natural language-based control over vocal expression. Users can direct a voice's delivery by embedding simple emotion tags directly into the text prompt, such as [whispers sweetly] or [laughing nervously], allowing for nuanced and context-aware speech generation without complex technical parameters.
Technically, S2 is engineered for speed and versatility, boasting a remarkably fast time-to-first-audio of just 100 milliseconds. It can generate multi-speaker dialogues in a single inference pass, streamlining the creation of conversational audio. The model supports over 80 languages, making it a broadly applicable tool. According to Fish Audio, S2 has outperformed every closed-source model, including those from Google and OpenAI, on key benchmarks like the Audio Turing Test and EmergentTTS-Eval, which measure how natural and human-like synthetic speech sounds.
The release of S2 on Hugging Face represents a significant shift in the accessible AI audio landscape. By providing a high-quality, controllable, and fast TTS model under an open-source license, Fish Audio is empowering developers, researchers, and creators to build applications without vendor lock-in or usage costs associated with APIs from major corporations. This move could accelerate innovation in voice assistants, audiobooks, game dialogue, and other media that require expressive, multi-lingual speech synthesis.
- Uses natural language emotion tags (e.g., [whispers sweetly]) for precise, intuitive control over vocal expression.
- Generates multi-speaker dialogue in one pass with a 100ms time-to-first-audio and supports 80+ languages.
- Reportedly beats closed-source models from Google and OpenAI on the Audio Turing Test and EmergentTTS-Eval benchmarks.
Why It Matters
Provides a free, high-quality alternative to proprietary TTS APIs, enabling more expressive and customizable voice applications without vendor lock-in.