Audio & Speech

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

arXiv eess.AS March 03, 2026

⚡New benchmark reveals XLS-R and WavLM Large outperform rivals, showing resilience to audio degradation.

Deep Dive

A team of researchers has published a pivotal new benchmark called Spoof-SUPERB, designed to systematically evaluate self-supervised learning (SSL) models for the critical task of audio deepfake detection. Published on arXiv and accepted at ICASSP, this work fills a significant gap, as audio deepfake detection had remained outside established evaluation frameworks like SUPERB despite its security importance. The benchmark provides a standardized, reproducible method to compare the performance of different SSL architectures—including generative, discriminative, and spectrogram-based models—across multiple in-domain and out-of-domain datasets, establishing much-needed baselines for the field.

The study's key finding is that large-scale discriminative SSL models, specifically XLS-R, UniSpeech-SAT, and WavLM Large, consistently outperformed other approaches. The researchers attribute this superiority to factors like multilingual pretraining, speaker-aware learning objectives, and sheer model scale. Furthermore, the analysis tested model robustness under acoustic degradations (like background noise or compression), revealing a critical weakness: generative models degraded sharply in performance, while discriminative models remained resilient. This benchmark directly equips developers and security teams with data-driven insights to select the most reliable AI representations for securing voice-based systems against the growing threat of convincing synthetic audio.

Key Points

Spoof-SUPERB benchmark evaluates 20 self-supervised learning (SSL) models for detecting synthetic audio.
Large discriminative models XLS-R, UniSpeech-SAT, and WavLM Large topped rankings, benefiting from scale and multilingual training.
Generative models showed poor robustness to audio degradation, while top discriminative models maintained performance.

Why It Matters

Provides essential, data-backed guidance for building defenses against AI voice scams and securing authentication systems.

Read Original Article

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Why It Matters

Stay Ahead in AI