SwanSphere uses a causal autoregressive diffusion transformer for streaming spatial audio generation with low latency?

SwanSphere uses a causal autoregressive diffusion transformer for streaming spatial audio generation with low latency.

Spatial Video-Audio Contrastive (SVAC) learning aligns video encoders with acoustic domains for better spatial perception?

Spatial Video-Audio Contrastive (SVAC) learning aligns video encoders with acoustic domains for better spatial perception.

Multi-objective online direct preference optimization (ODPO) and an automated caption pipeline enhance robustness and data availability?

Multi-objective online direct preference optimization (ODPO) and an automated caption pipeline enhance robustness and data availability.

Audio & Speech

SwanSphere generates streaming spatial audio from video and text

arXiv eess.AS June 01, 2026

⚡Real-time spatial audio from panoramic video with a diffusion transformer

Deep Dive

Researchers from an unnamed institution (authors: Ke Lei et al.) propose SwanSphere, a unified streaming framework for generating high-fidelity spatial audio from panoramic video and text prompts. The system tackles two key challenges in existing spatial audio synthesis: the trade-off between quality and latency, and difficulty in extracting precise spatial information from multimodal inputs. SwanSphere introduces a causal autoregressive diffusion transformer architecture that enables real-time streaming generation without sacrificing audio quality.

To align visual and acoustic domains, the team designs a Spatial Video-Audio Contrastive (SVAC) learning strategy, further enhanced by a multi-objective online direct preference optimization (ODPO) scheme that improves spatial perception and robustness. To address the scarcity of spatial audio datasets, they also develop an automated annotation pipeline that generates detailed spatial captions. Experimental results show SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. The paper has been accepted at ICML 2026, and demos are available online.

Key Points

SwanSphere uses a causal autoregressive diffusion transformer for streaming spatial audio generation with low latency.
Spatial Video-Audio Contrastive (SVAC) learning aligns video encoders with acoustic domains for better spatial perception.
Multi-objective online direct preference optimization (ODPO) and an automated caption pipeline enhance robustness and data availability.

Why It Matters

Enables real-time, high-quality spatial audio from video, critical for VR/AR, live streaming, and immersive media applications.

Read Original Article

SwanSphere generates streaming spatial audio from video and text

Why It Matters

Related Articles

🚀 Stay Ahead in AI