F5-TTS-DPS generates speech that fools top speaker verification systems
New TTS model achieves best-ever spoofing scores against three SASV systems...
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
F5-TTS-DPS, submitted to the WildSpoof 2026 TTS Track, is a text-to-speech model designed for in-the-wild data robustness. Built atop the F5-TTS architecture, it integrates Exponential Moving Average (EMA) into supervised fine-tuning to stabilize training and improve generalization across diverse acoustic conditions. To enhance synthesis fidelity, the model uses large language models (LLMs) and large audio language models (LALMs) for dual-scoring prompt selection, filtering reference audio and text prompts to avoid alignment issues common in noisy datasets.
Experimental results highlight a breakthrough in spoofing capability. While achieving UTMOS 3.20 and speaker similarity 0.51 on the development set, F5-TTS-DPS recorded the best a-DCF scores (0.1582, 0.5233, and 0.2562) against three state-of-the-art SASV (Speaker and Anti-Spoofing Verification) systems—outperforming all other submissions. Combined with competitive Word Error Rate (WER), this indicates the synthesized speech is nearly indistinguishable from real human speech, posing a significant challenge for current deepfake detection and speaker verification technologies.
- F5-TTS-DPS integrates EMA training and dual-scoring prompt selection with LLMs and LALMs for quality filtering
- Achieved best a-DCF scores (0.1582, 0.5233, 0.2562) across three advanced SASV systems in WildSpoof 2026
- UTMOS score 3.20 and speaker similarity 0.51 show high naturalness and authenticity
Why It Matters
This model pushes TTS realism to a level that evades top detection systems, raising stakes for deepfake countermeasures.