TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
A new framework reveals hidden flaws in TTS models using adversarial perturbations...
Researchers from multiple institutions have unveiled TTS-PRISM, a groundbreaking diagnostic framework for Mandarin text-to-speech (TTS) systems. Published on arXiv and submitted to Interspeech 2026, the model addresses a critical gap: while modern generative TTS models achieve near-human quality, monolithic metrics like MOS (Mean Opinion Score) fail to detect fine-grained acoustic artifacts or explain perceptual collapse. TTS-PRISM introduces a 12-dimensional evaluation schema spanning from basic stability to advanced expressiveness, enabling precise diagnosis of specific flaws. The team built a high-quality diagnostic dataset through a targeted synthesis pipeline that injects adversarial perturbations and leverages expert anchors to create challenging test cases.
TTS-PRISM employs schema-driven instruction tuning to embed explicit scoring criteria and reasoning into an efficient end-to-end model. On a 1,600-sample Gold Test Set, it outperformed generalist models in human alignment, demonstrating superior ability to detect subtle issues like unnatural prosody or glitches. The framework profiled six distinct TTS paradigms, generating intuitive diagnostic flags that reveal nuanced capability differences—such as which models struggle with emotional expressiveness versus temporal stability. Crucially, TTS-PRISM is fully open-source, with code and checkpoints available on GitHub, allowing researchers to replicate and extend the work. This tool promises to accelerate TTS quality assurance by replacing vague metrics with actionable, interpretable diagnostics.
- TTS-PRISM uses a 12-dimensional schema to evaluate Mandarin TTS across stability, expressiveness, and more.
- Tested on a 1,600-sample Gold Set, it outperforms generalist models in human alignment.
- Open-source release includes code and checkpoints on GitHub for reproducibility.
Why It Matters
TTS-PRISM enables precise, interpretable diagnostics for AI speech, replacing vague metrics with actionable insights.