Audio & Speech

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

arXiv eess.AS April 07, 2026

⚡New benchmark tests six major voice AI systems on real human speech with disfluencies and multi-step tool use.

Deep Dive

A research team led by Guan-Ting Lin has released Full-Duplex-Bench-v3 (FDB-v3), a comprehensive benchmark designed to evaluate how well modern voice AI agents handle real-world conversational challenges. Unlike previous synthetic benchmarks, FDB-v3 uses entirely real human audio annotated for five categories of speech disfluencies—like "ums," stutters, and corrections—paired with scenarios requiring chained API calls across four task domains. This creates a much more realistic testbed for systems that need to understand natural speech and execute multi-step actions, such as booking a flight while handling user interruptions.

The benchmark evaluated six major voice agent configurations: GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper→GPT-4o→TTS). The results revealed a clear trade-off between speed and conversational quality. GPT-Realtime led in accuracy (0.600 Pass@1) and was best at avoiding inappropriate interruptions (13.5% rate). Meanwhile, Gemini Live 3.1 achieved the fastest end-to-end latency at 4.25 seconds but suffered the lowest turn-taking success rate at 78.0%. The traditional cascaded system had perfect turn-taking but the highest latency at 10.12 seconds.

Across all tested systems, the study identified two persistent failure modes: handling user self-corrections (e.g., "I want a flight to—actually, to Boston") and performing multi-step reasoning in difficult scenarios. This indicates that while raw speed and single-turn accuracy are improving, building voice agents that can gracefully manage the full complexity of human dialogue remains a significant challenge. The benchmark provides developers with concrete metrics to track progress toward more natural and capable conversational AI.

Key Points

GPT-Realtime achieved highest accuracy (0.600 Pass@1) and best interruption avoidance (13.5%)
Gemini Live 3.1 delivered fastest latency (4.25s) but lowest turn-taking success (78.0%)
All systems struggled with user self-corrections and multi-step reasoning in hard scenarios

Why It Matters

Provides concrete metrics for developers building voice agents that must handle real human conversation flaws and complex tasks.

Read Original Article

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Why It Matters

Stay Ahead in AI