Gemini 3 fails to beat classical ML for attention detection in educational videos
Despite advanced reasoning, VLMs struggle with real-time attention detection from eye-tracking data.
A new study from Sorbonne University and CNRS explored whether Vision-Language Models (VLMs) can detect attention loss in educational videos by analyzing gaze patterns. Using an eye-tracking dataset of 70 learners watching instructional content, the researchers integrated gaze data with video frames and fed them to Google's Gemini 3 VLM under multiple prompting strategies. The goal was to leverage the model's semantic reasoning to contextualize where a learner was looking—essentially replacing handcrafted feature engineering with foundation model understanding.
However, none of the VLM-based approaches matched or exceeded classical machine learning classifiers trained on engineered features like fixation statistics and saccade properties. The negative result highlights a critical gap: while VLMs excel at static image reasoning, they fail to capture the temporal dynamics and subtle patterns of human attention over time. The authors conclude that current VLMs are not yet suitable for real-time educational diagnostics, reinforcing the value of traditional time-series models for this task.
- Researchers tested Gemini 3 with 70 learners' eye-tracking data from educational videos
- No prompting strategy (zero-shot, few-shot, chain-of-thought) beat classical ML baselines
- VLMs struggle with temporal attention patterns despite strong semantic reasoning
Why It Matters
Reveals that even advanced VLMs can't replace engineered features for real-time attention monitoring in education.