DMA-KWS framework achieves 97.85% AUC for custom wake words
New system adapts with only 187k parameters and works across speakers.
A new paper from Zhiqi Ai and colleagues at Shanghai University presents DMA-KWS, a robust framework for user-defined keyword spotting. The system tackles three persistent challenges: confusing similar-sounding words, inconsistent performance across different speakers, and high data requirements for reliable wake-word detection. DMA-KWS employs a dual-stage matching pipeline: first, CTC (Connectionist Temporal Classification) decoding with streaming phoneme search locates candidate segments; then, a QbyT (Query-by-Text) with phoneme matcher performs fine-grained verification. This two-step process dramatically improves discrimination of confusable keywords.
For personalization, DMA-KWS uses multi-modal enrollment—fusing user-specific speech with text embeddings—which consistently outperforms text-only enrollment in speaker-dependent settings. The framework also includes a parameter-efficient continual adaptation mechanism that updates only 187k parameters using synthetic and real data, making it ideal for on-device use. On the LibriPhrase Hard subset, DMA-KWS achieves state-of-the-art performance with 97.85% AUC and 6.13% Equal Error Rate. The paper, accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), includes 14 pages of detailed experiments and analysis.
- Dual-stage matching (CTC + QbyT phoneme verification) cuts confusion between similar keywords
- Multi-modal enrollment fuses user speech and text, boosting speaker-dependent accuracy
- Lightweight adaptation uses only 187k parameters, enabling on-device deployment
Why It Matters
Enables personalized, low-cost wake-word recognition for smart assistants and voice interfaces.