New voice privacy attack uses BERT to identify speakers from text alone
BERT-based ASV achieves 35% EER and as low as 2% on some speakers using only linguistic content
A team led by Ünal Ege Gaznepoglu from FAU Erlangen-Nuremberg and partners at Fraunhofer IIS, Saarland University, and others has demonstrated a novel voice privacy attack that leverages linguistic content rather than acoustic features. Their method, presented at INTERSPEECH 2025, adapts BERT—a language model—as an automatic speaker verification (ASV) system. On standard VoicePrivacy datasets, the attack achieved a mean equal error rate of 35%, with certain speakers identified with EERs as low as 2% based purely on what they said, not how they said it.
The attack's success stems from intra-speaker linguistic similarity: individuals tend to use consistent vocabulary, phrasing, and semantic patterns. The researchers' explainability study linked model decisions to semantically similar keywords across utterances, a bias introduced by how the LibriSpeech dataset is curated. This finding exposes a critical flaw in current speaker anonymization evaluations, which assume that removing voice biometrics (pitch, timbre, etc.) guarantees privacy. The authors call for reworking the VoicePrivacy datasets to ensure fair and unbiased evaluation and caution against relying solely on global EER metrics for privacy assessments.
- BERT-based ASV achieved 35% mean EER on VoicePrivacy datasets, with some speakers at 2% EER using only text
- Attack exploits intra-speaker linguistic similarity—consistent vocabulary and phrasing patterns—not acoustic features
- Study reveals dataset bias in LibriSpeech and recommends reworking evaluation benchmarks for voice privacy
Why It Matters
Voice anonymization systems may fail if attackers can profile speakers by what they say, not just how they sound.