Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
A new AI framework bridges ground and aerial views with a shared 3D latent space, achieving state-of-the-art results.
A research team led by Yancheng Zhang from the University of Central Florida has introduced Geo², a novel AI framework that tackles two critical geo-spatial tasks simultaneously: pinpointing a ground photo's location on a map (Cross-View Geo-Localization) and synthesizing a corresponding aerial or street-level view from the other perspective (Cross-View Image Synthesis). The core innovation lies in its use of pre-trained Geometric Foundation Models (GFMs), which are adept at extracting 3D structural information from 2D images. Geo² repurposes these powerful geometric priors to bridge the significant visual gap between ground-level and satellite imagery.
The framework consists of two main components. First, GeoMap creates a shared 3D-aware latent space where features from both ground and aerial images are embedded. This alignment drastically reduces the viewpoint discrepancy, making it far easier for the model to match a street photo to its correct map tile. Second, this unified latent space naturally enables image synthesis. The team developed GeoFlow, a flow-matching generative model conditioned on the geometry-aware embeddings, which can realistically generate an aerial view from a ground photo, or vice versa. A consistency loss ensures the synthesized images are coherent in both directions.
Extensive testing on major benchmarks—CVUSA, CVACT, and VIGOR—demonstrates Geo²'s effectiveness. The model achieves state-of-the-art performance, outperforming previous methods in both localization accuracy and the quality of synthesized cross-view images. This work highlights the untapped potential of geometric priors for complex real-world vision tasks that require a fundamental understanding of 3D space from disparate 2D viewpoints.
- Unifies two tasks: Geo² jointly performs cross-view geo-localization (CVGL) and bidirectional cross-view image synthesis (CVIS) in a single framework.
- Leverages 3D geometry: It uses Geometric Foundation Models (GFMs) like VGGT to embed features into a shared 3D-aware latent space (GeoMap), reducing viewpoint gaps.
- Sets new benchmarks: Achieves state-of-the-art results on CVUSA, CVACT, and VIGOR datasets for both locating photos and generating corresponding aerial/ground views.
Why It Matters
This technology could significantly improve mapping services, autonomous navigation, and augmented reality by seamlessly linking street-level and satellite perspectives.