Audio & Speech

Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework

arXiv eess.AS April 10, 2026

⚡New AI system boosts speech clarity by 23.7% by tracking where you're looking in noisy rooms.

Deep Dive

A research team from Academia Sinica has developed a novel solution to the classic 'cocktail party problem' with their Gaze-Guided Audio-Visual Speech Enhancement (GG-AVSE) framework. The system addresses a fundamental limitation in conventional audio-visual speech enhancement: determining which speaker a listener actually wants to hear in multi-talker environments. GG-AVSE solves this by using gaze direction as a supervisory cue, essentially letting users 'select' their target speaker simply by looking at them.

The technical implementation combines gaze signals with a YOLO5Face detector to extract facial features of the attended speaker, then integrates this information with the pretrained AVSEMamba model through two strategies: zero-shot merging and partial visual fine-tuning. For evaluation, the team created the AVSEC2-Gaze dataset specifically for this research. The results are impressive—GG-AVSE shows substantial improvements over gaze-free baselines, including a 10.08% boost in PESQ (Perceptual Evaluation of Speech Quality), a 5.18% improvement in STOI (Short-Time Objective Intelligibility), and a 23.69% gain in SI-SDR (Scale-Invariant Signal-to-Distortion Ratio).

What makes this approach particularly promising is its scalability for real-world applications. Unlike systems that require pre-enrolled speaker profiles or complex calibration, GG-AVSE leverages a natural human behavior—gaze direction—as its primary control mechanism. The framework demonstrates how combining multiple sensory inputs (audio, visual, and now gaze) can create more robust and intuitive AI systems. The paper has been accepted for presentation at IEEE ICASSP 2026, indicating its significance in the audio signal processing community.

Key Points

Uses gaze direction as supervisory cue to identify target speaker in multi-talker environments
Achieves 23.69% improvement in SI-SDR and 10.08% better PESQ scores over baselines
Combines YOLO5Face detection with AVSEMamba model through zero-shot merging and fine-tuning strategies

Why It Matters

Could enable smarter hearing aids, AR glasses, and meeting systems that automatically focus on who you're looking at.

Read Original Article

Tracking Listener Attention: Gaze-Guided Audio-Visual Speech Enhancement Framework

Why It Matters

Stay Ahead in AI