Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?
A new hierarchical model fuses audio, video, and text to automatically evaluate AI-dubbed content quality.
A team of researchers has developed a novel AI architecture designed to automatically evaluate the quality of AI-generated dubbed content, a task traditionally reliant on expensive and slow human scoring. The model, presented in a paper accepted at ICASSP 2026, uses a hierarchical approach to fuse complementary data from audio (speaker identity, prosody), video (facial expressions, scene cues), and text (semantic context). To overcome the scarcity of human-labeled data, the team innovated by creating proxy quality scores (proxy MOS) by aggregating objective metrics, with weights optimized via active learning, before fine-tuning with actual human ratings.
Trained on a dataset of 12,000 Hindi-English bidirectional dubbed clips, the architecture employs parameter-efficient LoRA adapters for fine-tuning across modalities. The system progressively fuses features through intra- and inter-modal layers, capturing the multi-dimensional aspects of dubbing quality like synchronization, intelligibility, and emotional alignment. The result is a model that achieves a Pearson Correlation Coefficient (PCC) greater than 0.75 with human perception, demonstrating strong alignment. This provides a practical, scalable solution for developers and platforms to rapidly and consistently assess AI-dubbing output during model training and content production cycles.
- Uses hierarchical fusion of audio, video, and text data to predict human Mean Opinion Scores (MOS) for AI dubs.
- Trained on 12,000 Hindi-English clips and achieves high perceptual alignment (PCC > 0.75).
- Employs LoRA adapters for efficient tuning and creates proxy labels via active learning to overcome data scarcity.
Why It Matters
Enables scalable, automated quality control for AI-generated voiceovers and dubbing, accelerating content localization.