Viral Wire

OpenAI Unveils New Real-Time Voice APIs: GPT-Realtime-2, -Translate, and -Whisper

Three new models: Realtime-2 with 128K context, Translate, and Whisper.

Deep Dive

OpenAI unveiled three new realtime voice models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. GPT-Realtime-2 is the flagship, bringing 'GPT-5-class reasoning' to real-time voice agents with a 128K context window (up from 32K) and up to 32K max output tokens. Developers can adjust reasoning effort from minimal to xhigh, enable preambles like 'let me check that,' and use parallel tool calls with audible transparency (e.g., 'checking your calendar'). The model also features improved interruption recovery and more controllable tone—speaking calmly, empathetically, or upbeat based on context. Scale AI reported it took the top spot on its Audio MultiChallenge S2S leaderboard, with a +15.2% bump in Big Bench Audio.

The companion models target specific use cases. GPT-Realtime-Translate supports live streaming translation from 70+ input languages into 13 output languages, enabling real-time multilingual conversations. GPT-Realtime-Whisper provides low-latency streaming transcription for captions, notes, and continuous speech understanding. OpenAI also published a voice prompting guide covering reasoning effort, preambles, tool behavior, and state maintenance. All three models are available in the Realtime API now; ChatGPT voice upgrades are 'coming soon.' Sam Altman noted users are increasingly using voice to 'dump' lots of context, driving the need for smarter, more fluent voice AI.

Key Points
  • GPT-Realtime-2: 128K context, GPT-5-class reasoning, adjustable reasoning effort from minimal to xhigh.
  • GPT-Realtime-Translate: streaming translation from 70+ languages to 13 output languages in real time.
  • GPT-Realtime-Whisper: low-latency streaming transcription with continuous speech understanding.

Why It Matters

Voice agents get near-human fluency, interruption handling, and tool use for enterprise applications.