Image & Video

O3N: Omnidirectional Open-Vocabulary Occupancy Prediction

New vision model uses polar-spiral Mamba to create 360° 3D scene understanding from any viewpoint.

Deep Dive

A research team led by Mengfei Duan and six other authors has introduced O3N, a breakthrough framework for omnidirectional open-vocabulary occupancy prediction. Unlike existing 3D occupancy methods limited by specific perspectives and predefined training data, O3N enables comprehensive 3D scene understanding from any viewpoint using purely visual inputs. The system's core innovation is its Polar-spiral Mamba (PsM) module, which embeds omnidirectional voxels in a polar-spiral topology, allowing for continuous spatial representation and long-range context modeling across full 360° environments.

O3N introduces two additional key modules: Occupancy Cost Aggregation (OCA) provides a principled mechanism for unifying geometric and semantic supervision within voxel space, ensuring consistency between reconstructed geometry and underlying semantic structure. Meanwhile, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, creating a consistent "pixel-voxel-text" representation triad. This approach enables the model to understand and label objects in 3D space using natural language descriptions, not just predefined categories.

Extensive testing shows O3N achieves state-of-the-art performance on established benchmarks including QuadOcc and Human360Occ, while demonstrating remarkable cross-scene generalization and semantic scalability. The model's ability to work with open vocabulary—understanding and labeling objects based on text descriptions rather than fixed categories—makes it particularly valuable for real-world applications where environments contain unexpected objects. The researchers will make the source code publicly available, potentially accelerating development in embodied AI and autonomous systems.

Key Points
  • Uses Polar-spiral Mamba (PsM) module for 360° spatial representation via polar-spiral topology
  • Achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks with strong cross-scene generalization
  • Enables open-vocabulary 3D understanding through Natural Modality Alignment (NMA) creating "pixel-voxel-text" representation

Why It Matters

Enables robots and autonomous agents to understand complex 3D environments with natural language, crucial for real-world deployment.