Audio & Speech

DuplexSLA: Full-duplex AI speaks, plans, and calls tools on a 160ms clock

DuplexSLA lets AI listen, speak, and call tools simultaneously on a 160ms timeline.

Deep Dive

DuplexSLA breaks the turn-based mold of traditional spoken dialogue systems by introducing a truly full-duplex architecture. Built on a dual-stream three-channel formulation—continuous user audio, discrete assistant audio, and a rate-limited textual action channel—the model decodes all streams jointly via a single backbone. This shared 160ms chunk timeline allows listening, speaking, planning, and tool calling to unfold concurrently, eliminating the need for external semantic VAD or turn-boundary-based tool invocation.

The model introduces two key capabilities: (1) semantic-driven turn-taking control, where interruptions, pauses, and backchannels are handled internally by the same neural network, and (2) in-conversation planning and tool calling, where planning text and structured tool calls are emitted on the action channel without pausing assistant audio. To validate these innovations, the authors constructed DuplexSLA-Bench, a benchmark covering pause, interrupt, and backchannel turn-taking alongside three styles of in-conversation tool calling. The project, demos, and evaluation suite are publicly available.

Key Points
  • DuplexSLA decodes assistant audio and structured actions on a shared 160ms chunk timeline for real-time synchronization.
  • Semantic-driven turn-taking handles interruption, pause, and backchannel internally without an external voice activity detector.
  • In-conversation planning and tool calls are emitted on a textual action channel without halting assistant speech.

Why It Matters

DuplexSLA brings real-time agentic behavior to voice AI, enabling seamless multitasking in conversational assistants.