Research & Papers

Automated Detection of Mutual Gaze and Joint Attention in Dual-Camera Settings via Dual-Stream Transformers

Uses frozen GazeLLE backbones and token fusion to outperform multimodal LLMs.

Deep Dive

Researchers Jakub Kosmydel, Paweł Gajewski, and Arkadiusz Białek have introduced a dual-stream Transformer architecture that automates the detection of mutual gaze (MG) and joint attention (JA) from synchronized dual-camera recordings. These behavioral cues are critical in developmental psychology, but manual coding is labor-intensive and error-prone. The new model overcomes cross-camera relational challenges by using frozen GazeLLE backbones—pretrained gaze-aware networks—to extract rich visual features, combined with a custom token fusion mechanism that captures spatial and semantic relationships between interacting pairs in a dyad.

The approach was evaluated on an ecologically valid dataset of caregiver-infant interactions, achieving strong performance. It significantly outperformed both a convolutional baseline and a state-of-the-art multimodal large language model (LLM), demonstrating the effectiveness of specialized architectures over general-purpose AI for this task. By open-sourcing the model and pre-trained weights, the team provides behavioral scientists with a scalable tool that can be fine-tuned to diverse laboratory environments, effectively bridging computational modeling and applied interaction research.

Key Points
  • Uses frozen GazeLLE backbones for gaze-aware visual encoding, avoiding full fine-tuning.
  • Custom token fusion mechanism integrates spatial and semantic cues across dual-camera feeds.
  • Open-sourced model outperforms convolutional baselines and multimodal LLMs on caregiver-infant interaction data.

Why It Matters

Automates labor-intensive gaze coding, accelerating developmental psychology research in multi-camera lab setups.