Audio & Speech

S2S-Arena benchmark tests speech AI on tone, emotion, and prosody

New evaluation reveals big gaps in how speech models handle paralinguistic cues.

Deep Dive

Current speech-to-speech (S2S) models excel at transcribing words but largely ignore paralinguistic cues—prosody, emotion, speaker traits—that are essential for natural human-like communication. To fill this gap, Feng Jiang and co-authors introduce S2S-Arena, a benchmark accepted at ACL 2026. The benchmark features a four-level interaction protocol that systematically probes models under increasing paralinguistic complexity, from simple semantic tasks to expressive commands requiring specific emotional tones or vocal styles. The two-stage data pipeline produced 1,243 speech samples covering over 100 real-world tasks, and the arena-style evaluation framework allows reference-free, pairwise comparisons directly in the speech modality.

Benchmarking 10 state-of-the-art S2S systems over more than 1,000 comparisons reveals striking performance gaps: even top-tier industrial models struggle when paralinguistic demands rise, while academic systems often fail entirely. The analysis identifies key design factors—such as joint training of semantic and prosodic objectives—that govern expressive instruction following. These findings provide actionable insights for developers aiming to build more natural, robust, and human-aligned speech agents. For tech professionals, S2S-Arena underscores that the next frontier in voice AI is not just what you say, but how you say it.

Key Points
  • S2S-Arena uses a four-level interaction protocol to test paralinguistic complexity, from basic commands to emotion-infused speech generation.
  • The benchmark includes 1,243 speech samples covering 100+ real-world tasks, enabling comprehensive evaluation of both semantic and expressive capabilities.
  • Testing 10 S2S systems over 1,000+ pairwise comparisons reveals substantial performance gaps; industrial models outperform but still lag on complex paralinguistic instructions.

Why It Matters

This benchmark pushes voice AI beyond mere transcription, making virtual assistants and voice interfaces truly natural and emotionally aware.