Audio & Speech

Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

New study tunes contrastive learning to catch fake voices with 4.44% pooled error rate

Deep Dive

Researchers Jaskirat Sudan, Hashim Ali, Surya Subramani, and Hafiz Malik from the University of Michigan-Dearborn published a controlled study on supervised contrastive learning (SupCon) for audio deepfake detection, a domain where SupCon's specific impact was previously underexplored. Using the wav2vec2 XLS-R (300M) model, they systematically varied two key SupCon components: similarity measure (cosine vs angular similarity derived from the hyperspherical angle) and negative scaling via a warm-started global cross-batch queue. The two-stage pipeline first fine-tuned the encoder and projection head with SupCon, then froze them to train a linear classifier with binary cross-entropy loss.

Trained on the ASVspoof 2019 LA dataset and evaluated across ASV19 eval, ITW, and ASVspoof 2021 DF/LA benchmarks, Cosine SupCon with a delayed queue achieved the best results: ITW equal error rate (EER) of 8.29% and pooled EER of 4.44%. Interestingly, angular similarity performed strongly without queued negatives (ITW 8.70%), suggesting reduced reliance on large negative sets—a finding that could simplify training pipelines. This work provides clear guidance for practitioners building robust deepfake audio detectors, emphasizing the importance of similarity choice and negative sampling strategy in contrastive learning.

Key Points
  • Cosine SupCon with delayed queue achieved best ITW EER (8.29%) and pooled EER (4.44%) across multiple benchmarks
  • Angular similarity performed well without large negative sets (ITW 8.70%), reducing computational overhead
  • Two-stage pipeline: SupCon fine-tuning of wav2vec2 XLS-R (300M) encoder, then frozen linear classifier with BCE

Why It Matters

Optimized contrastive learning cuts deepfake audio detection errors, critical for voice authentication and anti-fraud systems.