New pretraining method boosts speech-to-articulator mapping without SSL overhead
Researchers propose multi-target pretraining to replace heavy SSL models in low-resource speech AI.
Acoustic-to-Articulatory Inversion (AAI) is a key speech technology that estimates vocal tract movements from audio, supporting tasks like automatic speech recognition and speech synthesis. While deep learning models (CNNs, RNNs, Transformers) have made progress, recent work shows that Self-Supervised Learning (SSL) features improve performance in low-resource scenarios. However, SSL extractors add inference latency and compute overhead, limiting deployment. Bandekar and Ghosh, from the Indian Institute of Science, propose a solution: multi-target pretraining using three complementary representations—Phoneme Labels, Articulatory Feature Labels (e.g., tongue position, lip rounding), and Critical-articulator Labels (which articulator is most active). By training on these targets jointly, the model learns rich acoustic-articulatory mappings without needing an SSL backbone at inference time.
Experiments across varying data conditions show consistent improvements over baseline (no pretraining) and even SSL-augmented models, especially in low-resource setups with limited training data. The method reduces inference costs significantly while matching or exceeding accuracy. This work bridges the gap between SSL's effectiveness and practical deployment constraints, making AAI feasible for real-time applications like speech therapy feedback, silent speech interfaces, and assistive communication devices. The paper is available on arXiv with code expected to be released.
- Uses three target representations: Phoneme Labels, Articulatory Feature Labels, and Critical-articulator Labels during pretraining.
- Outperforms both baseline (no pretraining) and SSL-based models in low-resource scenarios while reducing inference overhead.
- Eliminates the need for a separate SSL feature extractor during inference, cutting latency and computational cost.
Why It Matters
Enables efficient, real-time vocal tract tracking from speech without heavy SSL models, benefiting speech therapy and assistive tech.