Audio & Speech

FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching

New AI model transforms whispered speech into normal voice using only synthetic data and 10-step inference.

Deep Dive

A research team including Fabian Ritter-Gutierrez, Pablo Peso Parada, and colleagues has developed FlowW2N, a novel AI system for whispered-to-normal speech conversion using conditional flow matching. The breakthrough addresses the fundamental challenge of reconstructing missing phonation from whispered input while preserving content and speaker identity, overcoming the traditional obstacles of temporal misalignment between whisper and voiced recordings and the scarcity of paired training data. The researchers' innovative approach trains exclusively on synthetic, time-aligned whisper-normal pairs while conditioning on domain-invariant features extracted from ASR embeddings.

The technical innovation lies in exploiting high-level ASR embeddings that demonstrate strong invariance between synthetic and real whispered speech, enabling the model to generalize to real whispers despite never observing them during training. The team developed a selection criterion optimizing both content informativeness and cross-domain invariance across ASR layers. FlowW2N achieves state-of-the-art intelligibility, reducing Word Error Rate by 26-46% relative to prior work on the CHAINS and wTIMIT datasets while using only 10 steps at inference. This represents a significant efficiency improvement over traditional methods and eliminates the need for real paired data collection, opening possibilities for medical applications, privacy-preserving communication, and assistive technologies.

Key Points
  • Achieves 26-46% reduction in Word Error Rate on CHAINS and wTIMIT datasets
  • Uses only synthetic training data and requires no real whisper-normal pairs
  • Performs conversion in just 10 inference steps for efficient deployment

Why It Matters

Enables new assistive technologies for speech disorders and private communication tools without requiring difficult-to-collect real whisper data.