Open Source

A New Framework for Evaluating Voice Agents (EVA)

Hugging Face Blog March 24, 2026

⚡First framework to jointly score task success and conversational quality in multi-turn voice interactions.

Deep Dive

ServiceNow AI researchers have introduced EVA (Evaluating Voice Agents), the first comprehensive framework designed to assess conversational voice agents on both task accuracy and conversational experience simultaneously. Unlike existing benchmarks that evaluate components in isolation—such as AudioBench for speech understanding or FD-Bench for conversational dynamics—EVA simulates complete, multi-turn spoken conversations over live audio. The framework uses a realistic bot-to-bot architecture where agents must invoke tools, follow policies, and reach verifiable end states across scenarios like flight rebooking and cancellations. It outputs two distinct scores: EVA-A for Accuracy (task completion) and EVA-X for Experience (naturalness, conciseness, appropriateness).

EVA launches with an initial dataset of 50 airline scenarios and benchmark results for 20 different systems, including cascade architectures and audio-native models like speech-to-speech systems and large audio language models. The most significant finding from this research is a consistent Accuracy-Experience tradeoff: agents that perform well on task completion metrics tend to deliver poorer conversational experiences, and vice versa. This reveals critical interaction dynamics that component-level testing misses, such as whether agents interrupt during natural pauses or recover smoothly from transcription errors. The framework is publicly available through a dedicated website, GitHub repository, and Hugging Face dataset, with plans to expand to additional domains beyond aviation.

Key Points

First framework to jointly evaluate task accuracy (EVA-A) and conversational experience (EVA-X) in voice agents
Uses bot-to-bot testing with 50 initial airline scenarios covering flight rebooking, cancellations, and vouchers
Reveals consistent tradeoff: agents good at task completion deliver worse user experiences, and vice versa

Why It Matters

Provides enterprises with standardized metrics to evaluate real-world voice agent performance beyond basic accuracy, impacting customer service quality and deployment decisions.

Read Original Article

A New Framework for Evaluating Voice Agents (EVA)

Why It Matters

Stay Ahead in AI