SwanSphere generates streaming spatial audio from video and text
Real-time spatial audio from panoramic video with a diffusion transformer
Researchers from an unnamed institution (authors: Ke Lei et al.) propose SwanSphere, a unified streaming framework for generating high-fidelity spatial audio from panoramic video and text prompts. The system tackles two key challenges in existing spatial audio synthesis: the trade-off between quality and latency, and difficulty in extracting precise spatial information from multimodal inputs. SwanSphere introduces a causal autoregressive diffusion transformer architecture that enables real-time streaming generation without sacrificing audio quality.
To align visual and acoustic domains, the team designs a Spatial Video-Audio Contrastive (SVAC) learning strategy, further enhanced by a multi-objective online direct preference optimization (ODPO) scheme that improves spatial perception and robustness. To address the scarcity of spatial audio datasets, they also develop an automated annotation pipeline that generates detailed spatial captions. Experimental results show SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. The paper has been accepted at ICML 2026, and demos are available online.
- SwanSphere uses a causal autoregressive diffusion transformer for streaming spatial audio generation with low latency.
- Spatial Video-Audio Contrastive (SVAC) learning aligns video encoders with acoustic domains for better spatial perception.
- Multi-objective online direct preference optimization (ODPO) and an automated caption pipeline enhance robustness and data availability.
Why It Matters
Enables real-time, high-quality spatial audio from video, critical for VR/AR, live streaming, and immersive media applications.