AWS SageMaker AI and vLLM enable real-time speech-to-text streaming
Bidirectional streaming cuts latency for voice agents and live captioning.
Amazon SageMaker AI now offers bidirectional streaming for real-time inference, a capability launching November 2025 that enables continuous data flow between clients and model containers over HTTP/2. vLLM complements this with its Realtime API, which uses WebSockets for bidirectional streaming and supports piecewise CUDA graph execution to reduce GPU kernel launch overhead, lowering per-token latency during streaming transcription. Together, they provide a fully managed path to deploy Mistral AI's Voxtral-Mini-4B-Realtime-2602, a compact real-time speech model, as a speech-to-text service.
Key features include native WebSocket endpoints at /v1/realtime, automatic protocol translation between HTTP/2 on the client side and WebSocket on the container side, and audio processing via base64 PCM16 chunks. SageMaker handles connection management with ping/pong keepalives, health checks, and CloudWatch monitoring. This eliminates the need to build custom streaming infrastructure or manage GPU servers, letting developers go from a Hugging Face model to a production-ready real-time transcription service for voice agents, live captioning, contact center analytics, and accessibility tools.
- Bidirectional streaming in SageMaker AI starts November 2025, using HTTP/2 for full-duplex client-server communication.
- vLLM's Realtime API (WebSocket-based) serves Voxtral-Mini-4B-Realtime-2602 with piecewise CUDA graphs for lower per-token latency.
- Protocol bridging between HTTP/2 and WebSocket is handled transparently by SageMaker, requiring no custom translation layer.
Why It Matters
Reduces latency and infrastructure complexity for real-time voice AI apps like voice agents and live captioning.