SANA-Streaming achieves real-time 24 FPS video editing on RTX 5090
Edit 1280x704 video in real-time on a single consumer GPU at 24 FPS.
SANA-Streaming is a new framework for real-time streaming video-to-video editing, designed to tackle the challenges of temporal consistency and inference throughput on consumer hardware. The system combines three core innovations: a Hybrid Diffusion Transformer that uses softmax attention in select blocks to enhance local modeling while keeping linear layers efficient; Cycle-Reverse Regularization, a training strategy that predicts source frames from generated content via flow matching to enforce semantic consistency without needing paired long videos; and a system co-design that fuses GDN kernels with mixed-precision quantization (MPQ) optimized for the NVIDIA Blackwell architecture (RTX 5090). The result is a system that can edit 1280x704 resolution video in real-time at 24 end-to-end FPS, with the DiT core itself running at 58 FPS.
Experimental results show SANA-Streaming significantly outperforms existing state-of-the-art methods in both temporal coherence and system throughput. The hybrid architecture balances local attention detail with linear layer speed, while the Cycle-Reverse Regularization ensures generated frames remain consistent with the original source over time. The co-design with MPQ maximizes Tensor Core utilization on RTX 5090 GPUs, enabling high-resolution, real-time editing that was previously only possible with expensive server-grade hardware. This work demonstrates that live video editing for applications like broadcasting and gaming is now feasible on a single consumer GPU.
- Hybrid Diffusion Transformer uses softmax attention in some blocks for local modeling while retaining efficient linear layers
- Cycle-Reverse Regularization enforces temporal consistency by predicting source frames from generated content via flow matching
- Achieves 24 FPS end-to-end at 1280x704 resolution on a single RTX 5090 GPU, with DiT core at 58 FPS
Why It Matters
Brings real-time, high-resolution video editing to consumer GPUs, enabling live broadcasting and gaming applications.