Research & Papers

Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge

A new multimodal system uses Gemini Embedding 2.0 and frozen Wav2Vec2 layers to analyze complex human emotions.

Deep Dive

A research team from the University of Zurich and ETH Zurich has unveiled a sophisticated multimodal AI system designed for the BLEMORE Challenge, a competition focused on recognizing blended human emotions and predicting their relative salience. Their novel approach, detailed in a new arXiv paper, combines six distinct families of encoders through a late probability fusion technique. The ensemble includes a specialized S4D-ViTMoE face encoder, strategically frozen layers from the Wav2Vec2 audio model, fine-tuned body-language models like TimeSformer and VideoMAE, and a groundbreaking first: the integration of Google's Gemini Embedding 2.0 for processing video input.

Key experimental findings reveal the system's technical ingenuity. The researchers discovered that selectively using layers 6-12 of a frozen Wav2Vec2 model for prosody encoding significantly outperformed end-to-end fine-tuning, as phonetic information was irrelevant for the non-verbal audio cues in the challenge. The system's 12-encoder fusion achieved a final score of 0.279 on the test set, securing 6th place. A critical insight was that the optimal threshold for determining emotion salience varied widely (0.05 to 0.43) across data folds, indicating that highly personalized expression styles remain the primary bottleneck for accurate, generalized emotion AI.

Key Points
  • First use of Google's Gemini Embedding 2.0 for video in emotion recognition, achieving a presence accuracy (ACCP) of 0.320 from just 2-second clips.
  • Frozen Wav2Vec2 layers (6-12) for prosody beat fine-tuned models, scoring 0.207 vs. 0.161, proving task-specific layer selection is key.
  • The system's 12-encoder fusion gave 62% weight to task-adapted models over general baselines, achieving a final score of 0.279 for 6th place.

Why It Matters

Advances multimodal AI's ability to interpret complex, real-world human emotional states, crucial for improving human-computer interaction and mental health tools.