Audio & Speech

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

A new hierarchical model fuses audio, video, and text to automatically evaluate AI-dubbed content quality.

Deep Dive

A team of researchers has developed a novel AI architecture designed to automatically evaluate the quality of AI-generated dubbed content, a task traditionally reliant on expensive and slow human scoring. The model, presented in a paper accepted at ICASSP 2026, uses a hierarchical approach to fuse complementary data from audio (speaker identity, prosody), video (facial expressions, scene cues), and text (semantic context). To overcome the scarcity of human-labeled data, the team innovated by creating proxy quality scores (proxy MOS) by aggregating objective metrics, with weights optimized via active learning, before fine-tuning with actual human ratings.

Trained on a dataset of 12,000 Hindi-English bidirectional dubbed clips, the architecture employs parameter-efficient LoRA adapters for fine-tuning across modalities. The system progressively fuses features through intra- and inter-modal layers, capturing the multi-dimensional aspects of dubbing quality like synchronization, intelligibility, and emotional alignment. The result is a model that achieves a Pearson Correlation Coefficient (PCC) greater than 0.75 with human perception, demonstrating strong alignment. This provides a practical, scalable solution for developers and platforms to rapidly and consistently assess AI-dubbing output during model training and content production cycles.

Key Points
  • Uses hierarchical fusion of audio, video, and text data to predict human Mean Opinion Scores (MOS) for AI dubs.
  • Trained on 12,000 Hindi-English clips and achieves high perceptual alignment (PCC > 0.75).
  • Employs LoRA adapters for efficient tuning and creates proxy labels via active learning to overcome data scarcity.

Why It Matters

Enables scalable, automated quality control for AI-generated voiceovers and dubbing, accelerating content localization.