Audio & Speech

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

New front-end outperforms SSL and hand-crafted features on Deepfake-Eval-2024

Deep Dive

Speech deepfake detection faces a fundamental trade-off in front-end design. Hand-crafted filterbank features (like MFCCs) are transparent and interpretable but fail to capture higher-level acoustic cues. SSL-based features (e.g., wav2vec 2.0) are more powerful but lack interpretability and often miss fine-grained spectral anomalies that distinguish real from fake speech.

To bridge this gap, Xi Xuan and colleagues propose the WST-X series, a novel family of feature extractors built on the wavelet scattering transform (WST). WST cascades wavelet convolutions with modulus nonlinearities to produce deformation-stable, multi-scale representations. Tested on the Deepfake-Eval-2024 benchmark plus cross-dataset evaluations on SpoofCeleb and In-the-Wild, WST-X outperforms existing front-ends by a wide margin. The authors find that a small averaging scale (J) combined with high-frequency and directional resolutions (Q, L) is critical for capturing subtle artifacts. The code is publicly available on GitHub.

Key Points
  • WST-X combines interpretability of hand-crafted features with the representational power of SSL models
  • Outperforms existing front-ends on three benchmarks: Deepfake-Eval-2024, SpoofCeleb, and In-the-Wild
  • Small averaging scale (J) and high frequency/directional resolutions (Q, L) are critical for detecting subtle spectral artifacts

Why It Matters

Improves deepfake audio detection accuracy and interpretability, crucial for trust and transparency in voice authentication.