F5-TTS-DPS integrates EMA training and dual-scoring prompt selection with LLMs and LALMs for quality filtering?

F5-TTS-DPS integrates EMA training and dual-scoring prompt selection with LLMs and LALMs for quality filtering

Achieved best a-DCF scores (0.1582, 0.5233, 0.2562) across three advanced SASV systems in WildSpoof 2026?

Achieved best a-DCF scores (0.1582, 0.5233, 0.2562) across three advanced SASV systems in WildSpoof 2026

UTMOS score 3.20 and speaker similarity 0.51 show high naturalness and authenticity?

UTMOS score 3.20 and speaker similarity 0.51 show high naturalness and authenticity

Audio & Speech

F5-TTS-DPS generates speech that fools top speaker verification systems

arXiv eess.AS May 25, 2026

⚡New TTS model achieves best-ever spoofing scores against three SASV systems...

Deep Dive

F5-TTS-DPS, submitted to the WildSpoof 2026 TTS Track, is a text-to-speech model designed for in-the-wild data robustness. Built atop the F5-TTS architecture, it integrates Exponential Moving Average (EMA) into supervised fine-tuning to stabilize training and improve generalization across diverse acoustic conditions. To enhance synthesis fidelity, the model uses large language models (LLMs) and large audio language models (LALMs) for dual-scoring prompt selection, filtering reference audio and text prompts to avoid alignment issues common in noisy datasets.

Experimental results highlight a breakthrough in spoofing capability. While achieving UTMOS 3.20 and speaker similarity 0.51 on the development set, F5-TTS-DPS recorded the best a-DCF scores (0.1582, 0.5233, and 0.2562) against three state-of-the-art SASV (Speaker and Anti-Spoofing Verification) systems—outperforming all other submissions. Combined with competitive Word Error Rate (WER), this indicates the synthesized speech is nearly indistinguishable from real human speech, posing a significant challenge for current deepfake detection and speaker verification technologies.

Key Points

F5-TTS-DPS integrates EMA training and dual-scoring prompt selection with LLMs and LALMs for quality filtering
Achieved best a-DCF scores (0.1582, 0.5233, 0.2562) across three advanced SASV systems in WildSpoof 2026
UTMOS score 3.20 and speaker similarity 0.51 show high naturalness and authenticity

Why It Matters

This model pushes TTS realism to a level that evades top detection systems, raising stakes for deepfake countermeasures.

Read Original Article

F5-TTS-DPS generates speech that fools top speaker verification systems

Why It Matters

Related Articles

🚀 Stay Ahead in AI