MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
New benchmark tests 10 languages, finds compositional controls remain a major bottleneck for current TTS systems.
A consortium of researchers has launched MINT-Bench, a groundbreaking multilingual benchmark designed to rigorously evaluate instruction-following text-to-speech (TTS) systems. Unlike previous benchmarks, MINT-Bench is built on a hierarchical taxonomy and a scalable data pipeline, enabling it to test models across ten languages with unprecedented diagnostic granularity. Its hybrid evaluation protocol jointly scores systems on three critical axes: content consistency (does the speech match the text?), instruction following (does it obey commands like "speak happily"?), and perceptual quality (does it sound natural?). The initial results provide a crucial snapshot of the current AI speech landscape.
Experiments using MINT-Bench reveal a competitive and nuanced field. While frontier commercial systems from companies like Google and Amazon lead in overall performance, the benchmark shows that leading open-source models, such as Meta's Voicebox and ElevenLabs' open models, are closing the gap. Notably, in localized settings like Chinese, some open-source models can even outperform their commercial counterparts. However, the benchmark also exposes major weaknesses: current systems, both commercial and open-source, struggle significantly with harder compositional instructions (e.g., "speak slowly and sadly") and fine-grained paralinguistic controls like emotion and emphasis. The team is releasing the full benchmark, data construction toolkit, and an online leaderboard to fuel further research in controllable and multilingual speech synthesis.
- MINT-Bench evaluates TTS models across 10 languages using a 3-axis protocol for content, instruction-following, and quality.
- Open-source models are highly competitive, with some outperforming commercial systems in specific languages like Chinese.
- The benchmark identifies compositional and paralinguistic controls as the biggest remaining bottlenecks for all current AI speech systems.
Why It Matters
Provides the first standardized way to compare AI voice models globally, driving competition and exposing key weaknesses in emotional and complex speech generation.