Audio & Speech

OpenSTBench: New benchmark unifies speech translation evaluation across 6 dimensions

Forget just semantic accuracy — OpenSTBench now measures speech quality, emotion, and latency too.

Deep Dive

OpenSTBench is a unified evaluation framework for speech translation systems, covering speech-to-text (S2TT) and speech-to-speech (S2ST) translation in offline and streaming settings. It jointly measures translation quality, speech quality, speaker preservation, emotion & paralinguistic fidelity, temporal consistency, and latency. Experiments show that systems with strong translation quality can still differ substantially in speech quality and temporal quality. The code and datasets are available online. The paper has been submitted to EMNLP 2026.

Key Points
  • OpenSTBench unifies evaluation for both speech-to-text (S2TT) and speech-to-speech translation (S2ST) in offline and streaming modes.
  • It measures 6 dimensions: translation quality, speech quality, speaker preservation, emotion/paralinguistic fidelity, temporal consistency, and latency.
  • Experiments revealed that systems with equal translation quality can vary significantly in speech and temporal quality, highlighting the need for multidimensional evaluation.

Why It Matters

OpenSTBench enables comprehensive comparison of speech translation systems, critical for real-world deployment across modalities and use cases.