M3T: Discrete Multi-Modal Motion Tokens for Sign Language Production
Researchers combine SMPL-X body with FLAME's face model to capture essential eyebrow raises and mouth movements.
A research team from the University of Surrey and other institutions has developed M3T, a novel AI system that significantly advances the quality of automated sign language production. The core innovation is SMPL-FX, a new 3D body model that couples the SMPL-X skeleton with the FLAME model's rich facial expression space. This solves a critical limitation where standard body models provided a facial representation too low-dimensional to encode grammatically obligatory non-manual features (NMFs) like mouth shapes, eyebrow raises, and head movements.
M3T tokenizes this multi-modal representation using separate Finite Scalar Quantization VAEs for the body, hands, and face, preventing the 'codebook collapse' that plagued previous methods. The system is an autoregressive transformer trained on this discrete vocabulary with an auxiliary translation objective to ground the embeddings in semantics. On the NMFs-CSL benchmark, where signs are distinguishable only by these facial and head cues, M3T achieved 58.3% accuracy, a substantial 9.3-point jump over the strongest prior pose-based model at 49.0%. It also set new state-of-the-art results on three major public benchmarks: How2Sign, CSL-Daily, and Phoenix14T.
This work moves beyond simple hand motion generation to address the full linguistic complexity of sign languages. By properly modeling the multi-channel nature of signing—where face, gaze, and body posture carry essential grammatical and emotional information—M3T represents a major step toward creating more natural and understandable AI sign language avatars for accessibility applications.
- Introduces SMPL-FX, a new 3D body model combining SMPL-X with FLAME's facial expression parameters to capture non-manual features.
- Uses modality-specific tokenization with Finite Scalar Quantization VAEs to avoid codebook collapse and represent the full expression space.
- Achieves 58.3% accuracy on the NMFs-CSL benchmark (beating 49.0% baseline) and state-of-the-art on three other major sign language datasets.
Why It Matters
Enables more accurate and natural AI sign language avatars by capturing the full grammar of facial expressions and body language, not just hand shapes.