Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning
A new method disentangles speaker traits from AI voice generators to trace deepfake audio back to its source.
A team of researchers, including Xi Xuan, Wenxin Zhang, and Tomi H. Kinnunen, has published a novel method for tracing AI-generated speech back to its source model. Their Speaker-Disentangled Metric Learning (SDML) framework tackles a core problem in audio forensics: current systems assume the 'fingerprint' of an AI voice generator is independent of the speaker's traits, but this paper shows that assumption is flawed. The new approach actively disentangles these two factors to create a purer signature of the AI tool itself.
The technical innovation lies in two novel loss functions. The first employs Chebyshev polynomials to solve gradient instability—a common hurdle during the complex optimization required for disentanglement. The second function projects the embeddings for both the source AI model and speaker identity into a hyperbolic space, using Riemannian geometry to measure distances. This mathematical approach helps strip away speaker-specific information, leaving behind more discriminative features that uniquely identify the AI generator.
The team validated their SDML framework on the MLAAD benchmark, introducing four new evaluation protocols specifically designed for source-speaker disentanglement scenarios. The results demonstrate the framework's effectiveness, providing a more robust tool for verifying if two synthetic speech samples originate from the same underlying AI system. This work, submitted to Interspeech 2026, represents a significant step in the arms race against increasingly convincing audio deepfakes.
- Proposes a Speaker-Disentangled Metric Learning (SDML) framework to verify if two AI speech clips are from the same generator.
- Uses Chebyshev polynomials to stabilize training and Riemannian metrics in hyperbolic space to separate speaker identity from AI source signature.
- Evaluated on the MLAAD benchmark with four new protocols, showing improved performance for forensic source verification.
Why It Matters
Provides a more sophisticated forensic tool to trace AI-generated audio deepfakes back to their source, crucial for combating misinformation.