Research & Papers

Modeling Cross-vision Synergy for Unified Large Vision Model

New architecture uses Mixture-of-Experts and dynamic routing to unify image, video, and 3D reasoning.

Deep Dive

A research team from the National University of Singapore (NUS) and Sea AI Lab has introduced PolyV, a novel unified Large Vision Model (LVM) that moves beyond simple functional integration to achieve 'cross-vision synergy.' The core innovation addresses a key limitation in current unified LVMs: while they can process images, videos, and 3D data, they often fail to leverage the complementary strengths and prior knowledge inherent to each modality. PolyV's architecture is designed to enable these modalities to interact and refine each other's understanding, aiming for a more holistic and intelligent visual reasoning system.

Technically, PolyV employs a sparse Mixture-of-Experts (MoE) model coordinated by a dynamic modality router. This allows specialized experts to handle modality-specific information while facilitating bidirectional interaction. Its training combines modality-specific pretraining with a novel 'coarse-to-fine synergy tuning' phase using knowledge distillation and object-/relation-level alignment. In extensive testing on 10 benchmarks spanning image, video, and 3D understanding—including tasks requiring specific spatial or temporal priors—PolyV demonstrated a consistent performance lead, achieving an average improvement of over 10% compared to its backbone model. This establishes a new framework for building LVMs where different visual modalities can truly work together, not just coexist.

Key Points
  • Uses a sparse Mixture-of-Experts (MoE) architecture with a dynamic modality router for specialized, interactive processing.
  • Achieved over 10% average performance improvement on 10 benchmarks across image, video, and 3D tasks.
  • Implements a synergy-aware training paradigm with knowledge distillation and multi-level alignment for cross-modal refinement.

Why It Matters

Enables AI systems to perform more holistic, human-like reasoning by combining insights from images, video, and 3D data.