Controllable Accent Normalization via Discrete Diffusion
A new AI system lets you dial an accent's strength up or down, from native-sounding to fully preserved.
A research team including Qibing Bai, Haizhou Li, and others has introduced DLM-AN, a novel system for controllable accent normalization. Unlike previous methods that offer a binary 'accented' or 'neutral' output, DLM-AN provides a tunable dial for accent strength. This is crucial for applications like language learning, where a learner might want to gradually reduce an accent, or dubbing, where some regional flavor might need to be retained. The core innovation is its use of masked discrete diffusion over self-supervised speech tokens, combined with a clever 'Common Token Predictor' that identifies which parts of the source audio already align with native pronunciation.
These identified tokens are then selectively reused to initialize the reverse diffusion process, creating a simple yet effective control mechanism: reusing more source tokens preserves more of the original accent character. Furthermore, the system incorporates a flow-matching Duration Ratio Predictor that automatically adjusts the total speech duration to better match the natural rhythm of the target accent, improving prosody. In experiments on multi-accent English datasets, DLM-AN reportedly achieved the lowest word error rate among all compared systems while delivering competitive accent reduction and smooth, interpretable control over accent strength. The paper has been submitted for review to Interspeech 2026.
- Uses masked discrete diffusion over self-supervised speech tokens for precise audio generation.
- Features a 'Common Token Predictor' that provides a simple knob to control accent retention strength.
- Includes a flow-matching Duration Ratio Predictor to automatically adjust speech timing and rhythm for naturalness.
Why It Matters
Enables nuanced applications in language learning tools, accessible media dubbing, and voice interfaces that respect cultural identity.