Researchers tested Gemini 3 with 70 learners' eye-tracking data from educational videos?

Researchers tested Gemini 3 with 70 learners' eye-tracking data from educational videos

No prompting strategy (zero-shot, few-shot, chain-of-thought) beat classical ML baselines?

No prompting strategy (zero-shot, few-shot, chain-of-thought) beat classical ML baselines

VLMs struggle with temporal attention patterns despite strong semantic reasoning?

VLMs struggle with temporal attention patterns despite strong semantic reasoning

Research & Papers

Gemini 3 fails to beat classical ML for attention detection in educational videos

arXiv cs.CV May 21, 2026

⚡Despite advanced reasoning, VLMs struggle with real-time attention detection from eye-tracking data.

Deep Dive

A new study from Sorbonne University and CNRS explored whether Vision-Language Models (VLMs) can detect attention loss in educational videos by analyzing gaze patterns. Using an eye-tracking dataset of 70 learners watching instructional content, the researchers integrated gaze data with video frames and fed them to Google's Gemini 3 VLM under multiple prompting strategies. The goal was to leverage the model's semantic reasoning to contextualize where a learner was looking—essentially replacing handcrafted feature engineering with foundation model understanding.

However, none of the VLM-based approaches matched or exceeded classical machine learning classifiers trained on engineered features like fixation statistics and saccade properties. The negative result highlights a critical gap: while VLMs excel at static image reasoning, they fail to capture the temporal dynamics and subtle patterns of human attention over time. The authors conclude that current VLMs are not yet suitable for real-time educational diagnostics, reinforcing the value of traditional time-series models for this task.

Key Points

Researchers tested Gemini 3 with 70 learners' eye-tracking data from educational videos
No prompting strategy (zero-shot, few-shot, chain-of-thought) beat classical ML baselines
VLMs struggle with temporal attention patterns despite strong semantic reasoning

Why It Matters

Reveals that even advanced VLMs can't replace engineered features for real-time attention monitoring in education.

Read Original Article

Gemini 3 fails to beat classical ML for attention detection in educational videos

Why It Matters

Related Articles

🚀 Stay Ahead in AI