New self-supervised method learns prosody while hiding speaker identity
Researchers at ACL 2026 achieve strong speaker disentanglement without sacrificing prosody task performance.
A new self-supervised learning framework from researchers at the University of Washington tackles a key privacy issue in speech AI: the leakage of speaker identity from acoustic-prosodic features like pitch. Kevin Everson and Mari Ostendorf, in their paper accepted to ACL 2026, propose an encoder that learns prosody representations while explicitly disentangling speaker characteristics. The approach uses a novel training objective that encourages the representation to discard identity information, measured by the inability to predict speaker labels from the learned embeddings.
The encoder was evaluated on three tasks: pitch reconstruction, detection of prosodic events (e.g., phrase boundaries, prominence), and speaker identification. It outperformed both raw prosody features and HuBERT-base embeddings on the prosody tasks, while achieving near-chance accuracy on speaker identification—a strong indicator of successful disentanglement. This means developers can use the representations for downstream applications (e.g., emotional speech synthesis, prosody-aware ASR) without exposing sensitive speaker attributes, addressing growing regulatory and ethical concerns around biometric data privacy.
- Self-supervised approach that learns prosody representations while stripping speaker identity.
- Outperforms HuBERT-base and raw prosody baselines on pitch reconstruction and prosodic event detection.
- Achieves near-chance speaker identification accuracy, confirming strong privacy protection.
Why It Matters
Enables privacy-compliant use of speech prosody in healthcare, voice assistants, and synthetic media without exposing speaker identity.