Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild
New benchmark reveals speech AI struggles with rare, context-specific words crucial for business use.
A team of researchers including Berkin Durmus and Chen Cen has published a new paper introducing the Contextual Earnings-22 benchmark. The core problem they address is the plateauing accuracy of speech-to-text systems on standard academic benchmarks, which contrasts with the high-stakes demands of industrial applications. The researchers hypothesize that the key difference is 'contextual conditioning'—the ability to accurately recognize rare, domain-specific vocabulary that is critical for usability, such as company names, product terms, or financial jargon in earnings calls. To fill the gap in standardized testing, they built this open dataset upon the existing Earnings-22 corpus, adding realistic custom vocabulary contexts to foster research and uncover hidden advancements in the field.
In their experiments, the team established six strong baselines for the two dominant technical approaches to this problem: keyword prompting and keyword boosting. Their findings show that both methods achieve comparable and significantly improved accuracy when scaled from small proof-of-concept systems to large-scale implementations. This benchmark is designed to move beyond tests dominated by common vocabulary and instead evaluate how well AI handles the specialized terms that have a disproportionate impact on real-world transcript utility. By providing a common ground for evaluation, Contextual Earnings-22 aims to steer research and development toward solving the practical challenges that currently limit speech AI adoption in professional, high-value scenarios.
- Introduces Contextual Earnings-22, an open benchmark built on the Earnings-22 dataset to test speech AI on custom, domain-specific vocabulary.
- Sets six baselines for keyword prompting and boosting, showing both methods yield significant accuracy gains when properly scaled.
- Aims to bridge the gap between plateaued academic scores and the growing industrial need for reliable transcription in high-stakes business contexts.
Why It Matters
It pushes AI development toward solving real business problems, like accurately transcribing financial calls with complex jargon, where errors are costly.