Audio & Speech

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

arXiv eess.AS May 01, 2026

⚡New front-end outperforms SSL and hand-crafted features on Deepfake-Eval-2024

Deep Dive

Speech deepfake detection faces a fundamental trade-off in front-end design. Hand-crafted filterbank features (like MFCCs) are transparent and interpretable but fail to capture higher-level acoustic cues. SSL-based features (e.g., wav2vec 2.0) are more powerful but lack interpretability and often miss fine-grained spectral anomalies that distinguish real from fake speech.

To bridge this gap, Xi Xuan and colleagues propose the WST-X series, a novel family of feature extractors built on the wavelet scattering transform (WST). WST cascades wavelet convolutions with modulus nonlinearities to produce deformation-stable, multi-scale representations. Tested on the Deepfake-Eval-2024 benchmark plus cross-dataset evaluations on SpoofCeleb and In-the-Wild, WST-X outperforms existing front-ends by a wide margin. The authors find that a small averaging scale (J) combined with high-frequency and directional resolutions (Q, L) is critical for capturing subtle artifacts. The code is publicly available on GitHub.

Key Points

WST-X combines interpretability of hand-crafted features with the representational power of SSL models
Outperforms existing front-ends on three benchmarks: Deepfake-Eval-2024, SpoofCeleb, and In-the-Wild
Small averaging scale (J) and high frequency/directional resolutions (Q, L) are critical for detecting subtle spectral artifacts

Why It Matters

Improves deepfake audio detection accuracy and interpretability, crucial for trust and transparency in voice authentication.

Read Original Article

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Why It Matters

Stay Ahead in AI