AFSS: Artifact-Focused Self-Synthesis for Mitigating Bias in Audio Deepfake Detection
A new self-synthesis technique eliminates the need for pre-collected fake audio datasets, forcing detectors to focus on artifacts.
A research team has introduced AFSS (Artifact-Focused Self-Synthesis), a groundbreaking method designed to solve a critical flaw in audio deepfake detection: dataset bias. Current detectors often learn to recognize the specific characteristics of the fake audio they were trained on, rather than the universal artifacts left by generative models, causing them to fail on new, unseen data. AFSS tackles this by synthetically creating its own training data. It takes real audio and generates 'pseudo-fake' samples through two core mechanisms—self-conversion and self-reconstruction—while strictly enforcing that the real and synthetic samples share the same speaker identity and semantic content. This forces the AI model to ignore irrelevant factors like voice or accent and focus solely on the subtle technical artifacts that betray a synthetic audio file.
Extensive testing across seven challenging datasets, including WaveFake and an 'In-the-Wild' set, demonstrates AFSS's superior generalization. The system achieved a state-of-the-art average Equal Error Rate (EER) of just 5.45%, with remarkably low scores of 1.23% on WaveFake and 2.70% on In-the-Wild data. A key innovation is its elimination of dependency on pre-collected fake datasets, which are often limited and biased. Instead, AFSS uses a learnable reweighting loss to dynamically emphasize the most informative synthetic samples during training. This self-contained approach not only sets a new performance benchmark but also provides a more robust and scalable framework for building detectors that can keep pace with rapidly evolving generative audio models.
- Generates its own 'pseudo-fake' training data from real audio via self-conversion/reconstruction, removing need for pre-collected fake datasets.
- Enforces same-speaker constraints to force the detector to focus exclusively on generation artifacts, not speaker-specific traits, mitigating bias.
- Achieved state-of-the-art 5.45% average EER across 7 datasets, including 1.23% on WaveFake, demonstrating superior generalization.
Why It Matters
Provides a more robust, bias-free foundation for detecting AI-generated audio, a critical need for security and trust in digital communications.