ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks
New framework slashes detection error from 39.6% to 7.4% by analyzing speech rhythm and emotion.
A research team led by Aurosweta Mahapatra has introduced ProSDD, a novel AI framework designed to combat a critical weakness in current speech deepfake detection (SDD) systems. While existing models perform well on standard benchmarks, they often fail against more sophisticated, emotionally expressive synthetic voices, as they tend to learn dataset-specific artifacts rather than the fundamental cues of genuine human speech. ProSDD addresses this by mimicking human intuition: it first learns the natural prosodic variability—the rhythm, stress, and intonation—inherent to real speech through a supervised masked prediction task. This foundational knowledge of how pitch, energy, and voice activity naturally fluctuate is then used to identify deviations characteristic of AI-generated fakes.
In a second stage, the model jointly optimizes this prosodic learning objective with traditional spoof classification. The results are striking. When trained on the ASVspoof 2024 dataset, ProSDD slashed the Equal Error Rate (EER) from 39.62% to just 7.38%, an 81% relative reduction. It also showed strong generalization, cutting the EER from 25.43% to 16.14% when trained on the older 2019 data. Furthermore, it achieved approximately 50% relative error reductions on specialized emotional spoofing datasets like EmoFake and EmoSpoof-TTS, proving its robustness against expressive attacks that fool other systems.
The paper, submitted to Interspeech 2026, represents a significant shift in strategy for audio forensics. Instead of chasing the ever-evolving artifacts left by specific AI voice generators, ProSDD anchors itself in the immutable characteristics of natural human prosody. This approach promises more future-proof and generalizable detection, moving the field closer to creating reliable safeguards against the misuse of hyper-realistic synthetic voices in fraud and disinformation.
- Cuts detection error on ASVspoof 2024 benchmark from 39.62% to 7.38% (81% relative reduction).
- Uses a two-stage process to learn natural speech prosody (pitch, energy, rhythm) then spot fakes.
- Achieves ~50% relative error reduction on emotional spoof datasets EmoFake and EmoSpoof-TTS.
Why It Matters
Provides a more robust defense against emotionally manipulative voice clones used in scams and misinformation campaigns.