A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection
New benchmark reveals XLS-R and WavLM Large outperform rivals, showing resilience to audio degradation.
A team of researchers has published a pivotal new benchmark called Spoof-SUPERB, designed to systematically evaluate self-supervised learning (SSL) models for the critical task of audio deepfake detection. Published on arXiv and accepted at ICASSP, this work fills a significant gap, as audio deepfake detection had remained outside established evaluation frameworks like SUPERB despite its security importance. The benchmark provides a standardized, reproducible method to compare the performance of different SSL architectures—including generative, discriminative, and spectrogram-based models—across multiple in-domain and out-of-domain datasets, establishing much-needed baselines for the field.
The study's key finding is that large-scale discriminative SSL models, specifically XLS-R, UniSpeech-SAT, and WavLM Large, consistently outperformed other approaches. The researchers attribute this superiority to factors like multilingual pretraining, speaker-aware learning objectives, and sheer model scale. Furthermore, the analysis tested model robustness under acoustic degradations (like background noise or compression), revealing a critical weakness: generative models degraded sharply in performance, while discriminative models remained resilient. This benchmark directly equips developers and security teams with data-driven insights to select the most reliable AI representations for securing voice-based systems against the growing threat of convincing synthetic audio.
- Spoof-SUPERB benchmark evaluates 20 self-supervised learning (SSL) models for detecting synthetic audio.
- Large discriminative models XLS-R, UniSpeech-SAT, and WavLM Large topped rankings, benefiting from scale and multilingual training.
- Generative models showed poor robustness to audio degradation, while top discriminative models maintained performance.
Why It Matters
Provides essential, data-backed guidance for building defenses against AI voice scams and securing authentication systems.