Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model
A new AI model achieves 1.56 mm accuracy in predicting tongue and lip movements from audio alone.
A team of researchers from the University of Lorraine has published a breakthrough paper demonstrating that AI can accurately reconstruct the physical movements of speech—tongue, lips, and jaw positions—from ordinary, clean audio recordings. Their model, trained initially on data from real-time magnetic resonance imaging (rt-MRI), achieves a root mean square error (RMSE) of just 1.56 millimeters when predicting vocal tract shapes. This performance is remarkably close to models tested on the denoised audio captured inside an MRI scanner, which is notoriously noisy and impractical for real-world use.
The core innovation lies in proving that a model trained on the complex, multi-modal MRI data can be effectively applied to standard, high-quality speech recordings. The researchers compared two parallel datasets from the same speaker: one from the MRI environment (heavily corrupted by scanner noise) and one recorded in a clean acoustic studio. By aligning the phonetic segments, they validated that clean speech supports articulatory inversion just as effectively. This finding removes a major barrier to practical deployment, as it eliminates the need for expensive, cumbersome MRI equipment during inference.
This technology, termed acoustic-to-articulatory inversion (AAI), opens the door to numerous applications. Speech therapists could use it to provide visual biofeedback to patients with articulation disorders, showing them a real-time animation of their tongue placement. Language learners could see a model of native pronunciation. Furthermore, it provides a powerful tool for linguists studying phonetics and could enhance speech synthesis systems by grounding them in physical articulation models, leading to more natural and expressive synthetic voices.
- The model achieves 1.56 mm RMSE accuracy in predicting vocal tract shapes from clean speech, rivaling MRI-based methods.
- It was trained on real-time MRI data but operates on standard audio, bypassing the need for noisy, impractical scanner recordings.
- The technique enables non-invasive, real-time visualization of articulation for speech therapy, linguistics, and improved speech synthesis.
Why It Matters
This enables real-time, visual feedback for speech therapy and language learning, and provides a physical model to improve AI-generated speech.