Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors
A new fusion method cuts system complexity in half while matching state-of-the-art detection accuracy.
A team of researchers from Brno University of Technology has published a novel method for creating more efficient and robust deepfake speech detection systems. Their paper, 'Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,' addresses a critical problem: while ensembles of large self-supervised learning (SSL) models like Wav2Vec2 or HuBERT achieve high accuracy, they become oversized and computationally expensive. The researchers propose using an evolutionary algorithm called NSGA-II to intelligently fuse the outputs of multiple detectors, optimizing not just for low error rates but also for minimal system complexity.
Their framework explores two main approaches. The first uses a binary encoding to select the best subset of detectors for a simple averaging ensemble. The second, more effective approach uses a real-valued encoding to optimize the specific weights assigned to each detector in a weighted sum. When tested on the challenging ASVspoof 5 dataset with 36 different SSL-based detectors, the real-valued method achieved a state-of-the-art Equal Error Rate (EER) of just 2.37% and a minimum Detection Cost Function (minDCF) of 0.0684.
The key breakthrough is the system's ability to find optimal trade-offs on a 'Pareto front,' offering a diverse set of deployment solutions. For instance, the method can identify configurations that match the performance of much larger ensembles while requiring only half the parameters. This directly translates to systems that are faster, cheaper to run, and more practical for integration into real-time applications like voice authentication or content moderation platforms, where both accuracy and speed are non-negotiable.
- Uses NSGA-II evolutionary algorithm to fuse 36 SSL-based detectors, optimizing for both accuracy and complexity.
- Achieves 2.37% EER on ASVspoof 5 dataset while reducing system parameters by 50% compared to standard ensembles.
- Provides a Pareto front of solutions, allowing developers to choose the optimal balance for their specific deployment needs.
Why It Matters
Enables highly accurate deepfake detection in real-world, resource-constrained environments, crucial for securing voice-based authentication and communications.