SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis
This AI can read your lips and generate your voice with unprecedented accuracy.
Researchers have introduced SLD-L2S, a new lip-to-speech synthesis model that directly maps visual lip movements to audio using a hierarchical latent diffusion framework. It bypasses traditional intermediate representations like mel-spectrograms to avoid information loss. The model employs a novel diffusion convolution block and reparameterized flow matching, achieving state-of-the-art generation quality on multiple benchmarks, surpassing existing methods in both objective and subjective evaluations.
Why It Matters
This technology could revolutionize accessibility tools, silent dictation, and forensic audio reconstruction from video evidence.