Audio & Speech

Audio2Tool: Bridging Spoken Language Understanding and Function Calling

New dataset reveals voice assistants fail at complex tasks in noisy environments

Deep Dive

Researchers from an academic team have unveiled Audio2Tool, a large-scale benchmark dataset of approximately 30,000 spoken queries designed to rigorously test the tool-calling capabilities of Speech Language Models (SpeechLMs) across three primary domains: Smart Car, Smart Home, and Wearables. The benchmark introduces a multi-tier complexity hierarchy, ranging from simple direct commands to complex multi-intent and needle-in-a-haystack extraction tasks, aimed at isolating distinct failure modes in voice assistant performance. To ensure realism, the team employed zero-shot voice cloning text-to-speech synthesis and diverse noise profiles to simulate in-the-wild acoustic conditions.

Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines reveal strong performance on simple commands but significant degradation under compositional and acoustic challenges. This highlights a critical gap in current voice AI systems' ability to handle real-world, complex queries. The dataset, which fills a void in domain breadth, acoustic diversity, and compositional reasoning complexity, will be released upon paper acceptance, providing a valuable resource for advancing voice assistant reliability and functionality in practical applications.

Key Points
  • Audio2Tool includes 30,000 queries across Smart Car, Smart Home, and Wearables domains
  • Benchmark uses multi-tier complexity: simple commands, multi-intent, and needle-in-a-haystack tasks
  • Realism achieved via zero-shot voice cloning and diverse noise profiles; models degrade significantly on complex tasks

Why It Matters

This benchmark exposes critical weaknesses in voice AI, driving improvements for reliable, real-world assistant interactions.