OpenSTBench: New benchmark unifies speech translation evaluation across 6 dimensions
Forget just semantic accuracy — OpenSTBench now measures speech quality, emotion, and latency too.
OpenSTBench is a unified evaluation framework for speech translation systems, covering speech-to-text (S2TT) and speech-to-speech (S2ST) translation in offline and streaming settings. It jointly measures translation quality, speech quality, speaker preservation, emotion & paralinguistic fidelity, temporal consistency, and latency. Experiments show that systems with strong translation quality can still differ substantially in speech quality and temporal quality. The code and datasets are available online. The paper has been submitted to EMNLP 2026.
- OpenSTBench unifies evaluation for both speech-to-text (S2TT) and speech-to-speech translation (S2ST) in offline and streaming modes.
- It measures 6 dimensions: translation quality, speech quality, speaker preservation, emotion/paralinguistic fidelity, temporal consistency, and latency.
- Experiments revealed that systems with equal translation quality can vary significantly in speech and temporal quality, highlighting the need for multidimensional evaluation.
Why It Matters
OpenSTBench enables comprehensive comparison of speech translation systems, critical for real-world deployment across modalities and use cases.