Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI
New AI model uses conversation history to decide when to listen, cutting false triggers by 40%.
A team of researchers has published a paper on arXiv introducing the Selective Attention System (SAS), a novel approach to a critical problem for on-device voice AI: knowing when a user is actually talking to the device. Current systems often misfire in noisy, multi-speaker environments because they analyze each utterance in isolation. The researchers' key insight is that this should be modeled as a Sequential Device-Addressed Routing (SDAR) problem, where the system uses the short-term history of an interaction—like who spoke last—to make a more informed decision about whether to activate.
On a proprietary 60-hour multi-speaker English test set, the audio-only version of SAS achieved an F1 score of 0.86. When fused with optional camera input for visual cues, performance jumped to an F1 of 0.95 (97% precision, 93% recall). The importance of context was starkly demonstrated: removing the causal interaction history component caused performance to plummet from 0.95 to 0.57. This shows that conversational context is a major signal, not just the acoustic properties of a single phrase.
Designed for the edge, SAS is built to run entirely on-device on common ARM Cortex-A class processors. It operates with a strict latency budget of under 150 milliseconds and a memory footprint of less than 20 MB, meeting the stringent requirements for always-on, battery-powered devices like smart speakers and phones. The model represents a shift from simple keyword spotting to a more conversational understanding of intent, which is essential for reliable voice interfaces in real-world settings.
- Achieves 95% accuracy (F1 score) by fusing audio with optional camera input to understand conversational context.
- Runs fully on-device with <150 ms latency and <20 MB memory footprint, suitable for ARM-based edge hardware.
- Treats activation as a 'sequential routing' problem, using interaction history to reduce false triggers by 40% compared to local classification.
Why It Matters
Enables more reliable, private voice assistants that work seamlessly in noisy homes and offices without cloud dependency.