Research & Papers

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

A new lightweight AI model achieves 5.79-degree accuracy for browser-based gaze tracking using only facial landmarks.

Deep Dive

Researcher Chenkai Zhang has introduced EMC-Gaze, a novel AI model for webcam-based gaze tracking that prioritizes practical deployment over raw accuracy. The system uses a lightweight, landmark-only approach with an E(3)-equivariant graph encoder and session-wise adaptation, requiring only facial landmarks instead of heavy image backbones. After a brief 9-point calibration, it achieves 5.79° ± 1.81° RMSE error at 100 cm distance, outperforming Elastic Net baselines (6.68° ± 2.34°). The model maintains this advantage across different subjects and head positions, with particularly strong performance on still-head queries (2.92° vs 4.45° error).

EMC-Gaze is designed for real-world use with minimal computational footprint. The exported encoder contains just 944,423 parameters, occupies 4.76 MB in ONNX format, and processes samples in 12.58 ms on average in Chromium 145 with ONNX Runtime Web. This makes it suitable for browser-based applications without specialized hardware. The system uses meta-training to learn how to quickly adapt to new users from minimal calibration data, tying Elastic Net at 1-shot calibration and outperforming it from 3-shot onward on the MPIIFaceGaze dataset.

The research represents a shift toward deployment-friendly gaze tracking that balances accuracy with practical constraints like calibration burden, robustness to head motion, and runtime efficiency. While not claiming state-of-the-art against heavier appearance-based systems, EMC-Gaze establishes a new operating point for applications where lightweight implementation matters more than absolute precision. The code and model are available through standard academic channels, potentially enabling new web-based eye-tracking applications.

Key Points
  • Achieves 5.79° RMSE error after 9-point calibration, beating Elastic Net by ~1°
  • Lightweight at 944,423 parameters (4.76MB) with 12.58ms browser inference time
  • Uses meta-training for quick adaptation to new users from minimal calibration data

Why It Matters

Enables practical webcam eye-tracking for accessibility, UX research, and gaming without specialized hardware.