Audio & Speech

DASB -- Discrete Audio and Speech Benchmark

New comprehensive benchmark shows discrete audio tokens are less robust and require careful tuning to preserve speaker identity and phonetic content.

Deep Dive

A research team led by Pooneh Mousavi has released the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework designed to evaluate and compare discrete audio tokens. These tokens, which convert sound into sequences of discrete symbols (like words), are crucial for building multimodal language models that can process both text and audio. The benchmark addresses a major problem in the field: inconsistent evaluation settings across studies make it difficult to identify the best tokenizers and configurations for preserving essential audio information like speaker identity, phonetic content, and paralinguistic cues.

DASB tests discrete representations across three domains—speech, general audio, and music—on a range of discriminative and generative tasks. The team's results reveal that discrete audio tokens are currently less robust than continuous representations (like raw waveforms or spectrograms) and require meticulous tuning of factors including model architecture, dataset size, learning rate, and model capacity. While semantic tokens (which capture higher-level meaning) generally outperform purely acoustic tokens, a significant performance gap persists between the best discrete tokens and continuous features.

The benchmark's public release, including its code, evaluation setup, and leaderboards, provides a much-needed standardized toolkit for the research community. This will accelerate progress by enabling fair comparisons and highlighting specific areas where discrete tokenization falls short. The findings underscore that turning audio into a "language" for AI models is more challenging than text, and closing the gap with continuous representations is a key hurdle for creating truly fluent audio-understanding and generation models.

Key Points
  • DASB is the first comprehensive benchmark for discrete audio tokens across speech, audio, and music domains.
  • Results show discrete tokens are less robust than continuous features and require careful hyperparameter tuning.
  • Semantic tokens outperform acoustic tokens, but a significant performance gap to continuous representations remains.

Why It Matters

Provides a standardized framework to accelerate development of AI that can truly understand and generate audio, a core challenge for multimodal assistants.