Research & Papers

Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion

A new multimodal AI model detects ambivalence by measuring conflict between what people say, show, and sound like.

Deep Dive

A research team from Brazil has published a novel solution for the 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition, part of the ABAW workshop at CVPR 2026. Their model introduces a 'divergence-based multimodal fusion' technique that doesn't just combine data from video, audio, and text—it actively measures the *conflict* between them. The system extracts visual cues via facial Action Units (AUs) using Py-Feat, processes audio with Wav2Vec 2.0, and analyzes text with BERT. Each modality is encoded by a BiLSTM network with attention pooling and projected into a shared space where their incongruence is calculated.

On the benchmark BAH (Bodily Ambivalence and Hesitancy) dataset, this approach achieved a Macro F1 score of 0.6808 on the validation set, a massive improvement over the challenge baseline of 0.2827—a 140% performance increase. Statistical analysis across 1,132 videos confirmed that the temporal variability of facial Action Units is the strongest visual indicator of ambivalence. The core innovation is the fusion module, which computes pairwise absolute differences between modality embeddings, directly capturing the misalignment between what a person's face shows, their voice conveys, and their words say, which is the hallmark of hesitant or conflicted behavior.

Key Points
  • Achieved a Macro F1 score of 0.6808, outperforming the challenge baseline (0.2827) by over 140%.
  • Uses a novel 'divergence-based fusion' that measures conflict between visual (Action Units), audio (Wav2Vec 2.0), and text (BERT) signals.
  • Analysis of 1,132 videos found temporal changes in facial Action Units to be the key visual discriminator for hesitation.

Why It Matters

Enables more nuanced AI for mental health screening, customer service analysis, and human-computer interaction by detecting subtle emotional conflicts.