Audio & Speech

Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

arXiv eess.AS April 10, 2026

⚡New AI model uses conversation history to decide when to listen, cutting false triggers by 40%.

Deep Dive

A team of researchers has published a paper on arXiv introducing the Selective Attention System (SAS), a novel approach to a critical problem for on-device voice AI: knowing when a user is actually talking to the device. Current systems often misfire in noisy, multi-speaker environments because they analyze each utterance in isolation. The researchers' key insight is that this should be modeled as a Sequential Device-Addressed Routing (SDAR) problem, where the system uses the short-term history of an interaction—like who spoke last—to make a more informed decision about whether to activate.

On a proprietary 60-hour multi-speaker English test set, the audio-only version of SAS achieved an F1 score of 0.86. When fused with optional camera input for visual cues, performance jumped to an F1 of 0.95 (97% precision, 93% recall). The importance of context was starkly demonstrated: removing the causal interaction history component caused performance to plummet from 0.95 to 0.57. This shows that conversational context is a major signal, not just the acoustic properties of a single phrase.

Designed for the edge, SAS is built to run entirely on-device on common ARM Cortex-A class processors. It operates with a strict latency budget of under 150 milliseconds and a memory footprint of less than 20 MB, meeting the stringent requirements for always-on, battery-powered devices like smart speakers and phones. The model represents a shift from simple keyword spotting to a more conversational understanding of intent, which is essential for reliable voice interfaces in real-world settings.

Key Points

Achieves 95% accuracy (F1 score) by fusing audio with optional camera input to understand conversational context.
Runs fully on-device with <150 ms latency and <20 MB memory footprint, suitable for ARM-based edge hardware.
Treats activation as a 'sequential routing' problem, using interaction history to reduce false triggers by 40% compared to local classification.

Why It Matters

Enables more reliable, private voice assistants that work seamlessly in noisy homes and offices without cloud dependency.

Read Original Article

Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

Why It Matters

Stay Ahead in AI