Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models
A new study reveals hidden states in popular voice AI models can be used to identify users.
A team of researchers from Nanyang Technological University and other institutions has published a critical study revealing a major privacy flaw in modern, always-on voice AI assistants. Their paper, "Privacy-Preserving End-to-End Full-Duplex Speech Dialogue Models," demonstrates that the internal hidden states of popular models like SALM-Duplex and Moshi leak substantial speaker identity information. Using the VoicePrivacy 2024 protocol, they found this leakage persists across all transformer layers, with SALM-Duplex showing stronger leakage in early layers and Moshi leaking uniformly. Alarmingly, the ability to link a voice to an identity (linkability) rises sharply within just the first few conversational turns, posing a significant risk for devices that are always listening.
To address this, the researchers proposed two novel streaming anonymization setups using a tool called Stream-Voice-Anon. The first, Anon-W2F, operates in the feature domain and successfully raises the Equal Error Rate (EER)—a measure of identification difficulty—from 11.2% to 41.0%, a 3.5x relative increase that approaches the 50% random-chance ceiling. The second, Anon-W2W, works at the waveform level and retains 78-93% of the original model's semantic understanding (measured by sBERT) while adding less than 0.8 seconds of latency. This work provides a crucial framework for developers to build voice AI that is both responsive and privacy-preserving, moving beyond simple audio encryption to protect the data within the model itself.
- SALM-Duplex and Moshi models leak speaker ID data across all transformer layers, with linkability spiking in the first few conversation turns.
- The proposed Anon-W2F anonymization method raises the identification error rate (EER) by over 3.5x to 41.0%, nearing random chance.
- The Anon-W2W method maintains 78-93% of semantic performance with sub-second latency (under 0.8s FRL), enabling real-time private conversation.
Why It Matters
This exposes a fundamental privacy risk in always-listening AI and provides a practical blueprint for building secure, real-time voice assistants.