Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics
New semantic-aware evaluation shows LLM-based speech recognition degrades as speaker count increases beyond two.
A research team from Johns Hopkins University, Brno University of Technology, and Mitsubishi Electric has published a paper introducing tcpSemER, a novel evaluation metric designed to assess the real-world performance of conversational automatic speech recognition (ASR) systems. The metric addresses a critical flaw in traditional benchmarks like Word Error Rate (WER), which penalizes synonyms and paraphrases as errors. tcpSemER replaces the Levenshtein distance calculation with embedding-based semantic similarity, capturing whether the *meaning* of a conversation is preserved, not just the exact words. The team also decomposed the conventional tcpWER metric to separately analyze errors in overlapping and non-overlapping speech segments.
In a systematic evaluation comparing end-to-end LLM-based systems (like OpenAI's Whisper) against traditional modular pipeline approaches, the researchers tested performance across four axes: overlap robustness, semantic fidelity, varying speaker counts, and single- versus multi-channel audio. The results, validated across three datasets, were revealing. While LLM-based systems are competitive in clean, two-speaker settings, their performance degrades significantly as the number of speakers increases and speech overlap becomes more frequent. In contrast, the older, modular pipeline architectures demonstrated greater robustness in these complex, multi-speaker environments common in real meetings.
This work highlights a significant gap between benchmark performance and real-world utility for AI-powered meeting transcription tools. The findings suggest that developers of conversational AI agents and transcription services cannot rely solely on single-speaker or clean audio benchmarks. For applications like Zoom call summaries or multi-party interview transcripts, the choice between a sleek, end-to-end LLM and a more complex, modular system involves a clear trade-off between simplicity and robustness in noisy, overlapping scenarios.
- Introduced tcpSemER, a semantic-aware metric using embedding similarity to evaluate meaning preservation, not just word accuracy.
- Found LLM-based ASR systems degrade with >2 speakers and overlapping speech, while modular pipelines remain 40% more robust.
- Highlights a critical evaluation gap, showing systems like Whisper may fail in real meetings despite acing single-speaker benchmarks.
Why It Matters
This exposes a key weakness in AI meeting assistants, forcing developers to choose between simple LLMs and robust pipelines for real-world use.