Acoustic-to-articulatory Inversion of the Complete Vocal Tract from RT-MRI with Various Audio Embeddings and Dataset Sizes
New technique uses RT-MRI data and HuBERT embeddings to reconstruct complete speech articulation from sound alone.
A research team from Université de Lorraine has published a breakthrough paper on arXiv detailing a novel AI model for acoustic-to-articulatory inversion. Unlike previous methods limited by electromagnetic articulography (EMA) sensors that only track accessible points, their approach reconstructs the entire vocal tract geometry—from glottis to lips—using approximately 3.5 hours of real-time MRI data from a single speaker. The key innovation involves using automatically extracted articulator contours from MRI images rather than raw pixel data, allowing the Bi-LSTM neural network to focus on essential geometric dynamics while discarding redundant information.
The researchers conducted two critical experiments: evaluating different audio embeddings (MFCCs, LCCs, and HuBERT) and testing dataset sizes from 10 minutes to the full 3.5 hours. The HuBERT embeddings proved most effective, and the model's performance scaled predictably with more data. Evaluation metrics included RMSE, median error, and specialized Tract Variables plus larynx height measurement. The resulting 1.48mm average RMSE—better than the MRI's 1.62mm pixel resolution—demonstrates unprecedented accuracy in predicting vocal tract shapes from audio alone.
This complete vocal tract inversion represents a significant advancement over partial reconstruction methods. The contour-based approach reduces computational complexity while maintaining anatomical precision, and the systematic testing of embeddings provides clear guidance for future implementations. The work establishes a new benchmark for what's possible in speech science and related AI applications.
- Achieved 1.48mm RMSE accuracy—beating the MRI's 1.62mm pixel resolution—for full vocal tract reconstruction
- Used HuBERT audio embeddings with Bi-LSTM architecture on 3.5 hours of RT-MRI data from a single speaker
- Innovative contour-based approach focuses on geometric dynamics rather than raw MRI images for better efficiency
Why It Matters
Enables more natural speech synthesis, improved speech therapy tools, and better voice-controlled interfaces by understanding the complete physics of speech production.