Audio & Speech

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

New benchmark reveals XLS-R and WavLM Large outperform rivals, showing resilience to audio degradation.

Deep Dive

A team of researchers has published a pivotal new benchmark called Spoof-SUPERB, designed to systematically evaluate self-supervised learning (SSL) models for the critical task of audio deepfake detection. Published on arXiv and accepted at ICASSP, this work fills a significant gap, as audio deepfake detection had remained outside established evaluation frameworks like SUPERB despite its security importance. The benchmark provides a standardized, reproducible method to compare the performance of different SSL architectures—including generative, discriminative, and spectrogram-based models—across multiple in-domain and out-of-domain datasets, establishing much-needed baselines for the field.

The study's key finding is that large-scale discriminative SSL models, specifically XLS-R, UniSpeech-SAT, and WavLM Large, consistently outperformed other approaches. The researchers attribute this superiority to factors like multilingual pretraining, speaker-aware learning objectives, and sheer model scale. Furthermore, the analysis tested model robustness under acoustic degradations (like background noise or compression), revealing a critical weakness: generative models degraded sharply in performance, while discriminative models remained resilient. This benchmark directly equips developers and security teams with data-driven insights to select the most reliable AI representations for securing voice-based systems against the growing threat of convincing synthetic audio.

Key Points
  • Spoof-SUPERB benchmark evaluates 20 self-supervised learning (SSL) models for detecting synthetic audio.
  • Large discriminative models XLS-R, UniSpeech-SAT, and WavLM Large topped rankings, benefiting from scale and multilingual training.
  • Generative models showed poor robustness to audio degradation, while top discriminative models maintained performance.

Why It Matters

Provides essential, data-backed guidance for building defenses against AI voice scams and securing authentication systems.