Audio & Speech

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

New benchmark reveals speech models fail to act on vocal cues like tone and speaker identity, creating hidden risks.

Deep Dive

A research team led by Yuxiang Wang and 11 other authors has introduced VoxSafeBench, a pioneering benchmark designed to evaluate the social alignment of Speech Language Models (SLMs) as they move into shared, multi-user environments. The core insight is that a model's response must consider more than just the words spoken; it must also account for who is speaking, their vocal tone, and the surrounding environment. A request that is benign in text can become unsafe, unfair, or a privacy violation when these acoustic factors are present. Existing benchmarks fail to capture this complexity, focusing on basic audio comprehension or studying risks in isolation.

VoxSafeBench employs a Two-Tier evaluation framework. Tier1 assesses content-centric risks using matched text and audio. Tier2 is the novel component, targeting 'audio-conditioned risks' where the transcript is harmless, but the correct response depends on paralinguistic cues. The benchmark includes 22 tasks with bilingual coverage and intermediate perception probes to confirm models can detect cues. The results expose a critical 'speech grounding gap': frontier SLMs successfully recognize relevant social norms in text but consistently fail to apply them when the decisive cue—like a speaker's distress, demographic identity, or a private location—is conveyed solely through speech. This leads to significant drops in safety awareness, erosion of fairness, and faltering privacy protections.

The findings challenge the assumption that safeguards robust in text will translate to speech. They highlight a major blind spot for developers of voice assistants, customer service bots, and ambient AI, where context is everything. The publicly released code and data provide a crucial tool for developers to stress-test their models against these real-world, multi-dimensional social risks, pushing the field toward SLMs that are truly context-aware.

Key Points
  • Two-Tier benchmark evaluates SLMs on safety, fairness, and privacy across 22 bilingual tasks, focusing on audio-conditioned risks.
  • Reveals a 'speech grounding gap': models detect acoustic cues (tone, speaker) but fail to act on them, degrading safety by up to significant margins.
  • Publicly available code and data provide a tool to test real-world risks in voice AI, crucial for assistants and multi-user environments.

Why It Matters

Exposes hidden vulnerabilities in voice AI used in homes and customer service, where context from tone and environment is critical for safe, fair interactions.