A New Framework for Evaluating Voice Agents (EVA)
First framework to jointly score task success and conversational quality in multi-turn voice interactions.
ServiceNow AI researchers have introduced EVA (Evaluating Voice Agents), the first comprehensive framework designed to assess conversational voice agents on both task accuracy and conversational experience simultaneously. Unlike existing benchmarks that evaluate components in isolation—such as AudioBench for speech understanding or FD-Bench for conversational dynamics—EVA simulates complete, multi-turn spoken conversations over live audio. The framework uses a realistic bot-to-bot architecture where agents must invoke tools, follow policies, and reach verifiable end states across scenarios like flight rebooking and cancellations. It outputs two distinct scores: EVA-A for Accuracy (task completion) and EVA-X for Experience (naturalness, conciseness, appropriateness).
EVA launches with an initial dataset of 50 airline scenarios and benchmark results for 20 different systems, including cascade architectures and audio-native models like speech-to-speech systems and large audio language models. The most significant finding from this research is a consistent Accuracy-Experience tradeoff: agents that perform well on task completion metrics tend to deliver poorer conversational experiences, and vice versa. This reveals critical interaction dynamics that component-level testing misses, such as whether agents interrupt during natural pauses or recover smoothly from transcription errors. The framework is publicly available through a dedicated website, GitHub repository, and Hugging Face dataset, with plans to expand to additional domains beyond aviation.
- First framework to jointly evaluate task accuracy (EVA-A) and conversational experience (EVA-X) in voice agents
- Uses bot-to-bot testing with 50 initial airline scenarios covering flight rebooking, cancellations, and vouchers
- Reveals consistent tradeoff: agents good at task completion deliver worse user experiences, and vice versa
Why It Matters
Provides enterprises with standardized metrics to evaluate real-world voice agent performance beyond basic accuracy, impacting customer service quality and deployment decisions.