Audio & Speech

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

New LLM-based model cuts interruption detection errors by nearly threefold using real-world dialogue data.

Deep Dive

A research team led by Kangxiang Xia has published a breakthrough paper tackling a core flaw in modern voice assistants: unnatural turn-taking. Current systems are stuck between two bad options—fast but "trigger-happy" voice activity detectors that mistake user backchannels (like "uh-huh") for interruptions, or slow end-to-end models that create awkward conversational delays. The team's work, "Semantic-Aware Interruption Detection in Spoken Dialogue Systems," provides a three-part solution to this long-standing problem.

First, they built SID-Bench, the first benchmark dataset for interruption detection compiled entirely from real human dialogues, moving beyond synthetic data. Second, they proposed the Average Penalty Time (APT) metric, a novel way to quantitatively measure the trade-off between false alarms and late responses by assigning a temporal cost to each error. Finally, they designed an LLM-based detection model specifically trained to understand the subtle semantic cues that signal a genuine interruption intent.

The results are significant. Their optimized model outperforms mainstream baselines, achieving a nearly threefold reduction in the APT metric. This means the system is both more responsive and more accurate, effectively dissolving the tension between speed and stability that has plagued the field. By making their benchmark, code, and model publicly available, the researchers have established a new state-of-the-art and provided the tools needed for the next generation of truly conversational AI agents that can handle natural, overlapping speech.

Key Points
  • Introduced SID-Bench, the first real-world benchmark for semantic-aware interruption detection, built from human dialogues.
  • Proposed the novel Average Penalty Time (APT) metric to rigorously quantify the responsiveness-robustness trade-off in dialogue systems.
  • The team's LLM-based detection model reduces the APT metric by nearly 3x, outperforming current mainstream baseline methods.

Why It Matters

This breakthrough enables voice assistants and customer service bots to handle natural, overlapping conversation without awkward pauses or constant false interruptions.