Research & Papers

GGPT: Geometry Grounded Point Transformer

Researchers' new transformer model solves geometric inconsistencies in 3D reconstruction from just a few photos.

Deep Dive

A research team including Yutong Chen, Yiming Wang, and Sergey Prokudin has introduced GGPT (Geometry Grounded Point Transformer), a novel AI framework that addresses a critical limitation in current 3D reconstruction technology. While existing feed-forward networks can create 3D point clouds directly from RGB images, they often produce geometrically inconsistent results with limited fine-grained accuracy, especially when working with only a few input views. GGPT solves this by augmenting the reconstruction process with reliable sparse geometric guidance, creating a principled bridge between geometric priors and dense neural predictions.

The system operates in two key stages. First, an improved Structure-from-Motion pipeline uses dense feature matching and lightweight geometric optimization to efficiently estimate accurate camera poses and generate a partial 3D point cloud from the sparse input views. This sparse geometric data then serves as explicit supervision for the second stage: a geometry-guided 3D point transformer that refines a dense point map prediction. The transformer uses an optimized guidance encoding to ensure the final output respects multi-view consistency, recovering fine structures and intelligently filling gaps in textureless areas that typically challenge AI models.

Extensive experiments demonstrate GGPT's superior performance. Trained solely on the ScanNet++ dataset using predictions from another model called VGGT, GGPT shows remarkable generalization capability. It substantially outperforms current state-of-the-art feed-forward 3D reconstruction models across different architectures and datasets, in both familiar (in-domain) and novel (out-of-domain) settings. The work, accepted for CVPR 2026, provides a new blueprint for integrating classical computer vision geometry with modern transformer-based deep learning, moving beyond purely data-driven approaches.

Key Points
  • Combines improved SfM pipeline with geometry-guided transformer for explicit multi-view constraints
  • Trained on ScanNet++ with VGGT data, generalizes across architectures and datasets
  • Substantially outperforms state-of-the-art models in accuracy and geometric consistency

Why It Matters

Enables more reliable 3D scanning from fewer photos for robotics, AR/VR, and digital twins.