UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?
9 VLMs tested on 1,980 screenshots reveal moderate alignment with human attention
A new study from Min Song and colleagues introduces UIGaze, a systematic investigation into how well Vision Language Models (VLMs) can approximate human visual attention on user interfaces. Using the UEyes dataset, which includes 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants, the researchers evaluated nine state-of-the-art VLMs. These models were tasked with zero-shot coordinate prediction—generating gaze point coordinates without prior training on the task. The coordinates were then converted into saliency maps via Gaussian blurring and compared against ground truth using metrics like correlation coefficient (CC), similarity (SIM), and Kullback-Leibler (KL) divergence. The experiments, spanning 1,980 images × 9 models × 3 runs × 3 durations, revealed that VLMs achieve moderate alignment with human gaze patterns, with the degree of alignment varying significantly across UI types and improving with longer viewing durations.
The findings suggest that VLMs are better at capturing exploratory gaze patterns—where users scan broadly—rather than initial fixations, which are more focused and task-specific. This insight has practical implications for UI design, as VLMs could augment or replace costly eye-tracking studies for understanding user attention on interfaces like websites or mobile apps. However, the moderate alignment indicates that current models are not yet reliable substitutes for human data, especially for precision tasks. The researchers have made all code, predictions, and evaluation results publicly available, fostering further research in this intersection of computer vision and human-computer interaction.
- Evaluated 9 VLMs on 1,980 UI screenshots from 62 participants across 4 UI types
- Models achieved moderate alignment, varying by UI type and improving with longer viewing durations
- VLMs capture exploratory gaze patterns rather than initial fixations, limiting precision
Why It Matters
VLMs could reduce reliance on costly eye-tracking for UI design, but current accuracy limits practical use.