Audio & Speech

SwanSphere generates streaming spatial audio from video and text

Real-time spatial audio from panoramic video with a diffusion transformer

Deep Dive

Researchers from an unnamed institution (authors: Ke Lei et al.) propose SwanSphere, a unified streaming framework for generating high-fidelity spatial audio from panoramic video and text prompts. The system tackles two key challenges in existing spatial audio synthesis: the trade-off between quality and latency, and difficulty in extracting precise spatial information from multimodal inputs. SwanSphere introduces a causal autoregressive diffusion transformer architecture that enables real-time streaming generation without sacrificing audio quality.

To align visual and acoustic domains, the team designs a Spatial Video-Audio Contrastive (SVAC) learning strategy, further enhanced by a multi-objective online direct preference optimization (ODPO) scheme that improves spatial perception and robustness. To address the scarcity of spatial audio datasets, they also develop an automated annotation pipeline that generates detailed spatial captions. Experimental results show SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. The paper has been accepted at ICML 2026, and demos are available online.

Key Points
  • SwanSphere uses a causal autoregressive diffusion transformer for streaming spatial audio generation with low latency.
  • Spatial Video-Audio Contrastive (SVAC) learning aligns video encoders with acoustic domains for better spatial perception.
  • Multi-objective online direct preference optimization (ODPO) and an automated caption pipeline enhance robustness and data availability.

Why It Matters

Enables real-time, high-quality spatial audio from video, critical for VR/AR, live streaming, and immersive media applications.