Stream Vision Agents + Amazon Nova 2 Sonic enable real-time voice agents in minutes
Speech-to-speech model meets open-source framework for sub-500ms join times.
Building production-grade voice AI that feels natural requires orchestrating speech recognition, language models, and text-to-speech within hundreds of milliseconds. Stream's new integration combines its open-source Vision Agents framework with Amazon Nova 2 Sonic, a speech-to-speech foundation model available through Amazon Bedrock. Nova 2 Sonic accepts audio input and produces audio output directly, eliminating the need for separate STT and TTS services. It provides real-time bidirectional audio streaming, native turn detection, and function calling.
Vision Agents provides a plugin-based Python framework with 25+ integrations and client SDKs for React, iOS, Android, Flutter, and React Native. It abstracts infrastructure complexity like WebRTC connection management, automatic reconnection, and graceful degradation. Together with Stream's globally distributed edge network (sub-500ms join times, under 30ms audio latency), developers can build and deploy voice agents within minutes. The architecture keeps sensitive data in the customer's AWS account while Stream handles media transport.
- Amazon Nova 2 Sonic handles full speech-to-speech pipeline, no separate STT/TTS needed
- Stream Vision Agents provides open-source framework with 25+ integrations, client SDKs for major platforms
- Edge network delivers sub-500ms join times and under 30ms audio latency for natural conversation flow
Why It Matters
Voice AI apps can now go from concept to production in minutes without building custom real-time infrastructure.