Bidirectional streaming in SageMaker AI starts November 2025, using HTTP/2 for full-duplex client-server communication?

Bidirectional streaming in SageMaker AI starts November 2025, using HTTP/2 for full-duplex client-server communication.

vLLM's Realtime API (WebSocket-based) serves Voxtral-Mini-4B-Realtime-2602 with piecewise CUDA graphs for lower per-token latency?

vLLM's Realtime API (WebSocket-based) serves Voxtral-Mini-4B-Realtime-2602 with piecewise CUDA graphs for lower per-token latency.

Protocol bridging between HTTP/2 and WebSocket is handled transparently by SageMaker, requiring no custom translation layer?

Protocol bridging between HTTP/2 and WebSocket is handled transparently by SageMaker, requiring no custom translation layer.

Developer Tools

AWS SageMaker AI and vLLM enable real-time speech-to-text streaming

AWS Machine Learning Blog May 21, 2026

⚡Bidirectional streaming cuts latency for voice agents and live captioning.

Deep Dive

Amazon SageMaker AI now offers bidirectional streaming for real-time inference, a capability launching November 2025 that enables continuous data flow between clients and model containers over HTTP/2. vLLM complements this with its Realtime API, which uses WebSockets for bidirectional streaming and supports piecewise CUDA graph execution to reduce GPU kernel launch overhead, lowering per-token latency during streaming transcription. Together, they provide a fully managed path to deploy Mistral AI's Voxtral-Mini-4B-Realtime-2602, a compact real-time speech model, as a speech-to-text service.

Key features include native WebSocket endpoints at /v1/realtime, automatic protocol translation between HTTP/2 on the client side and WebSocket on the container side, and audio processing via base64 PCM16 chunks. SageMaker handles connection management with ping/pong keepalives, health checks, and CloudWatch monitoring. This eliminates the need to build custom streaming infrastructure or manage GPU servers, letting developers go from a Hugging Face model to a production-ready real-time transcription service for voice agents, live captioning, contact center analytics, and accessibility tools.

Key Points

Bidirectional streaming in SageMaker AI starts November 2025, using HTTP/2 for full-duplex client-server communication.
vLLM's Realtime API (WebSocket-based) serves Voxtral-Mini-4B-Realtime-2602 with piecewise CUDA graphs for lower per-token latency.
Protocol bridging between HTTP/2 and WebSocket is handled transparently by SageMaker, requiring no custom translation layer.

Why It Matters

Reduces latency and infrastructure complexity for real-time voice AI apps like voice agents and live captioning.

Read Original Article

AWS SageMaker AI and vLLM enable real-time speech-to-text streaming

Why It Matters

Related Articles

🚀 Stay Ahead in AI