Image & Video

GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

New system runs object detection in under 10ms using only 2.2 KiB of memory per frame.

Deep Dive

A research team from UT Austin and other institutions has introduced GLANCE (Gaze-Led Attention Network for Compressed Edge-inference), a novel AI architecture designed to overcome the severe computational bottlenecks in AR/VR headsets. Inspired by biological foveal vision, the system uses a two-stage pipeline: first, an ultra-efficient, differentiable weightless neural network estimates user gaze direction with just 393 multiply-accumulate operations (MACs) and 2.2 KiB of memory per frame, achieving an angular error of 8.32 degrees. This gaze prediction then guides a second-stage object detection model to focus only on the attended region of interest (ROI), rather than processing the entire visual field uniformly.

This selective processing yields massive efficiency gains. The GLANCE system reduces computational burden by 40-50% and slashes energy consumption by 65% compared to running full-frame detection. Remarkably, the team successfully deployed the entire pipeline on a microcontroller—the Arduino Nano 33 BLE—where it maintains a critical sub-10ms latency for real-time interaction. On the COCO benchmark, it achieves a mean Average Precision (mAP) of 48.1%, and accuracy jumps to 51.8% for objects within the attended ROI. This ROI-based method significantly outperforms a global YOLOv12n baseline, especially for small objects (51.3% vs. 39.2%).

The work demonstrates a paradigm shift from compute-centric to memory-centric AI for edge devices. By replacing traditional arithmetic-heavy operations with memory lookups for gaze tracking and explicitly modeling human attention, GLANCE delivers better accuracy and far greater efficiency on resource-constrained wearable platforms. It proves that high-quality computer vision is feasible on micro-watt scale hardware, paving the way for longer battery life and more responsive experiences in next-generation AR/VR and wearable tech.

Key Points
  • Uses weightless neural network for gaze estimation with only 393 MACs and 2.2 KiB memory per frame, cutting energy use by 65%.
  • Deploys full object detection pipeline on Arduino Nano 33 BLE, achieving 48.1% mAP on COCO with sub-10ms latency.
  • ROI-focused detection boosts accuracy for small objects to 51.3%, beating a 39.2% baseline from uniform processing.

Why It Matters

Enables complex, real-time AI vision on microcontrollers, making advanced AR/VR applications practical with dramatically longer battery life.