Research & Papers

I can't believe text normalization is so underdiscussed in streaming text-to-speech [D]

New benchmark tests 1000+ sentences to expose how TTS models fail on dates, URLs, and phone numbers.

Deep Dive

A new benchmark from Async Voice AI is putting a spotlight on a critical but often overlooked weakness in real-time streaming Text-to-Speech (TTS) models: text normalization. While users typically judge TTS on voice quality and naturalness, models frequently stumble on pronouncing basic structured data like prices, dates, URLs, promo codes, and phone numbers correctly. This benchmark systematically tests over 1,000 sentences across 31 specific categories to measure how well commercial models handle these challenges, using Google's Gemini AI to evaluate the output.

The findings validate a significant pain point for developers in production environments. A model might sound human-like reading a paragraph but can completely fail to correctly verbalize a date like "03/05/2024" or a URL like "https://example.com." The benchmark, while acknowledged as a vendor-led study, provides a much-needed framework for comparing models on practical accuracy, not just subjective voice appeal. This moves the conversation beyond marketing specs to the gritty details of real-world deployment where reliable pronunciation of data is non-negotiable.

For engineering teams, this benchmark offers a concrete starting point to evaluate TTS solutions for applications like customer service IVRs, audiobooks with numerical data, or navigation systems. It underscores that choosing a TTS model requires testing against your specific use case's data formats. The discussion it has sparked highlights a gap in the industry's focus, pushing vendors to improve core normalization engines alongside voice realism.

Key Points
  • Benchmark tests 1000+ sentences across 31 categories like dates, URLs, and phone numbers.
  • Uses Google's Gemini AI to evaluate pronunciation accuracy in streaming TTS models.
  • Reveals a major production gap where models with good voice quality fail on basic data formatting.

Why It Matters

For developers, reliable text normalization is essential for professional TTS applications in customer service, accessibility, and navigation.