Audio & Speech

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

arXiv eess.AS February 13, 2026

⚡This AI can read your lips and generate your voice with unprecedented accuracy.

Deep Dive

Researchers have introduced SLD-L2S, a new lip-to-speech synthesis model that directly maps visual lip movements to audio using a hierarchical latent diffusion framework. It bypasses traditional intermediate representations like mel-spectrograms to avoid information loss. The model employs a novel diffusion convolution block and reparameterized flow matching, achieving state-of-the-art generation quality on multiple benchmarks, surpassing existing methods in both objective and subjective evaluations.

Why It Matters

This technology could revolutionize accessibility tools, silent dictation, and forensic audio reconstruction from video evidence.

Read Original Article

SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

Why It Matters

Stay Ahead in AI