Audio & Speech

DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

First voice anonymizer with dialable prosody control at inference time.

Deep Dive

A team of researchers from Johns Hopkins University and MIT-IBM Watson AI Lab have introduced DiffAnon, a novel diffusion-based framework for voice anonymization that gives users explicit control over how much emotional nuance (prosody) is preserved versus stripped away for privacy. Traditional voice anonymizers force a fixed trade-off: either they discard prosody entirely to maximize privacy, or they keep it but risk leaking speaker identity. DiffAnon solves this by using classifier-free guidance (CFG) on top of a diffusion model that refines acoustic details over semantic embeddings from an RVQ codec (residual vector quantization codec). This allows continuous, interpolatable control at inference time—meaning a single trained model can produce outputs ranging from fully anonymized (robotic, neutral) to highly expressive (preserving emotion, tone, and pitch) based on a simple slider.

In experiments, DiffAnon achieved strong utility (preserving prosodic features like stress and intonation) while maintaining competitive privacy compared to state-of-the-art baselines across multiple controllable operating points. The work was submitted to Interspeech 2026 and builds on prior research in diffusion models for speech synthesis and voice conversion. The key innovation is the structured, interpolatable inference-time control, which the authors claim is a first for voice anonymization frameworks. This could have major implications for privacy-preserving voice assistants, call center analytics, and any application where emotional expression matters but speaker identity must be protected.

Key Points
  • First voice anonymization framework with continuous, interpolatable inference-time prosody control via classifier-free guidance
  • Uses diffusion model to refine acoustic detail over RVQ codec semantic embeddings
  • Achieves strong utility-privacy trade-off across multiple operating points in experiments

Why It Matters

Enables privacy-preserving voice apps that retain emotional nuance, critical for healthcare, customer service, and virtual assistants.