Audio & Speech

Discrete Optimal Transport and Voice Conversion

A new AI framework transforms voices so well it can trick state-of-the-art spoof detectors.

Deep Dive

Researchers Anton Selitskiy and Maitreya Kocharekar have introduced kDOT, a novel voice conversion (VC) framework based on discrete optimal transport (OT). Unlike previous methods like kNN-VC and SinkVC that rely on averaging strategies, kDOT operates in a pretrained speech embedding space and constructs a transport map between source and target speaker distributions using the barycentric projection of the OT plan. This approach fundamentally improves distribution alignment, as demonstrated in comprehensive experiments on the LibriSpeech dataset, where it consistently outperformed existing techniques on key metrics including Word Error Rate (WER), Mean Opinion Score (MOS), and Fréchet Audio Distance (FAD).

The technical breakthrough of kDOT lies in its strong domain adaptation capability within the embedding space. The researchers conducted a systematic ablation study over the number of transported embeddings and analyzed the impact of utterance duration. Most alarmingly, they demonstrated that applying discrete OT as a post-processing step can transform AI-generated spoofed speech into audio samples that are misclassified as bona fide (real) by a state-of-the-art spoofing detector. This finding exposes critical vulnerabilities in current voice security systems and highlights the dual-use nature of the technology—while improving voice conversion quality, it also poses significant new challenges for audio forensics and biometric security.

Key Points
  • kDOT framework uses discrete optimal transport with barycentric projection for voice conversion, outperforming kNN-VC and SinkVC on LibriSpeech benchmarks.
  • The method can transform spoofed speech to bypass a state-of-the-art detector, revealing a major security flaw in current systems.
  • Comprehensive ablation studies analyzed the impact of transported embedding count and source/target utterance duration on performance.

Why It Matters

This breakthrough improves AI voice quality but also creates powerful new tools for audio deepfakes, forcing a security arms race.