Research & Papers

PanoWorld generates geometry-consistent 360° video from a single image

A new AI model creates panoramic video with accurate depth and motion...

Deep Dive

A team from Northeastern University (led by Sarah Ostadabbas) has released PanoWorld, a panoramic video world model that generates 360° video from just one image and a text caption. The key innovation is shifting from pure visual synthesis to a geometry- and dynamics-consistent latent state modeling problem. While existing methods produce visually plausible videos, they often exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. PanoWorld fixes this by adding two lightweight regularizers: a depth consistency loss that aligns generated depth with pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises 3D world-frame positions of tracked points over time. The model also applies spherical-geometry-aware conditioning and positional encoding to better handle the 360° domain.

The team also introduces PanoGeo, a unified dataset combining real and synthetic sources with consistent depth, trajectory, and prompt annotations. Experiments show that PanoWorld significantly improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism. The work positions panoramic video generation as a geometric modeling problem essential for holistic spatial understanding in embodied AI—robots and autonomous systems that need to navigate and interact with full 360° environments. Code is available on GitHub.

Key Points
  • Generates 360° video from a single image + caption with explicit depth and trajectory consistency losses
  • Introduces PanoGeo dataset with unified depth, trajectory, and prompt annotations across real/synthetic sources
  • Achieves better geometric consistency than prior methods without sacrificing visual realism

Why It Matters

Enables more reliable spatial understanding for embodied AI (robots, autonomous systems) using consistent 360° video.