Audio & Speech

SwanBench-Speech: New benchmark exposes speech AI's expressive failures

1,101 samples across 17 speech scenarios reveal models still can't match human consistency

Deep Dive

A team of 15 researchers led by Changhao Pan has released SwanBench-Speech, a comprehensive benchmark for evaluating long-form speech generation systems. Unlike existing benchmarks that focus on short utterances or narrow domains, SwanBench-Speech targets the growing need for high-fidelity speech in extended contexts—such as audiobooks, dialogues, and presentations. The benchmark comprises 1,101 samples spanning 17 common speech scenarios, systematically covering acoustic quality, semantic coherence, and expressive delivery. To assess these dimensions, the team defines seven automated metrics that go beyond traditional word-error-rate or MOS scores, explicitly measuring consistency and hierarchical structure in long-form outputs.

The benchmark's experiments reveal significant shortcomings in current state-of-the-art models. While short-phrase synthesis has reached impressive fidelity, long-context generation still exhibits noticeable degradation: models struggle to maintain character voice consistency across paragraphs, and they fail to convey proper prosodic hierarchy (e.g., distinguishing main clauses from subordinate ones). Expressive scenarios—like emotional narration or conversational turns—show the largest gap compared to real human recordings. SwanBench-Speech provides a standardized evaluation protocol that can guide future research toward more robust, context-aware speech generation. The paper is accepted at ACL 2026 Findings and includes 14 figures detailing performance across diverse conditions.

Key Points
  • SwanBench-Speech evaluates 11 state-of-the-art speech models across 17 scenarios with 1,101 samples
  • Seven new metrics measure acoustic quality, semantic coherence, and expressive delivery in long-form speech
  • Current models show a 30–40% drop in consistency and hierarchy scores vs. real recordings in expressive tasks

Why It Matters

As speech AI moves from short commands to audiobooks and virtual assistants, this benchmark provides essential guardrails for quality.