Why is these still no realistic voice model despite huge advancements in AI?
OpenAI teased a hyper-realistic voice model years ago but hasn't shipped it.
The AI world has seen jaw-dropping progress in image generation (Midjourney, DALL-E 3) and video (Sora, Runway), but voice AI remains stubbornly robotic. OpenAI teased a groundbreakingly realistic voice model years ago — one that could capture emotion, hesitation, and natural cadence — yet it has never been released. The current voice chat mode (used in ChatGPT) is adequate for trivia and simple commands, but its tone, pacing, and lack of expressiveness make it feel distinctly artificial during extended conversations. Users consistently report that even the best voice assistants fall into an uncanny valley, undermining trust and natural interaction.
Sesame AI has emerged as the strongest contender for voice realism, offering remarkably human-like tone and inflection. However, it is widely criticized for being 'low-IQ', meaning it struggles with reasoning, context, and complex tasks. This trade-off highlights a core bottleneck: achieving natural voice while maintaining high intelligence requires immense compute and novel architectures. Meanwhile, no major player has closed the gap. The disparity between voice and other modalities suggests a fundamental technical hurdle — perhaps involving prosody control, low-latency inference, or training data constraints. Until this is solved, voice AI will remain the weak link in the multimodal revolution.
- OpenAI showcased a hyper-realistic voice model years ago but never released it to the public.
- Current voice chat is robotic and lacks natural cadence for everyday conversations, limiting usability.
- Sesame AI leads in voice realism but is criticized for low intelligence, highlighting the realism-IQ trade-off.
Why It Matters
The lack of realistic voice AI holds back natural human-computer interaction and conversational AI adoption.