Audio & Speech

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

One model replaces five speech processing modules for faster, more natural conversations.

Deep Dive

Traditional speech interaction systems rely on a cascade of separate modules—voice activity detection (VAD), turn-taking detection (TD), speaker recognition (SR), automatic speech recognition (ASR), and question answering (QA). This pipeline accumulates latency, loses information, and propagates errors, especially in full-duplex scenarios where both parties speak simultaneously.

To address this, a team of researchers introduces UAF (Unified Audio Front-end LLM), the first model that reframes all these front-end tasks as a single autoregressive sequence prediction problem. UAF takes streaming fixed-duration audio chunks (600 ms) as input, uses a reference audio prompt at the start to identify the target speaker, and generates discrete tokens that encode both semantic content and system-level states like interruption signals.

Experimental results show that UAF achieves leading performance across all evaluated tasks—VAD, TD, SR, ASR, and QA—while dramatically improving response latency and interruption accuracy in real-world interaction settings. By unifying the front-end, UAF eliminates the need for multiple separate components, enabling truly seamless, full-duplex speech interaction. The work highlights that optimizing the audio front-end is just as critical as advancing the back-end model for conversational AI systems like GPT-4o.

Key Points
  • UAF replaces 5 separate front-end components (VAD, TD, SR, ASR, QA) with one unified autoregressive LLM.
  • Processes 600ms audio chunks in real-time, using a reference audio prompt to anchor the target speaker.
  • Outperforms cascaded systems in both response latency and interruption accuracy during full-duplex interactions.

Why It Matters

Voice assistants will feel more human by enabling simultaneous listening and speaking without lag or errors.