Pinpoint AI geolocates any photo worldwide with record accuracy
Combines Flickr and street view images to pinpoint locations faster than ever without LLMs.
Pinpoint tackles the longstanding challenge of image geolocation—estimating where a photo was taken from its visual content. Prior work treated internet photos and street-view imagery as separate tasks, but Pinpoint combines both in a coarse-to-fine pipeline. It first trains a contrastive image-GPS embedder on Flickr and street-view data to learn a shared embedding space for retrieving candidate locations. Then, an attention-based reranker refines those candidates by fusing visual and GPS features with cross-source evidence from nearby locations. This hybrid approach leverages the complementary strengths of internet photos (which better match user-captured queries) and street-view imagery (which provides denser geographic coverage).
Crucially, Pinpoint does not rely on multimodal large language models (MLLMs), which are slow and hard to reproduce. Instead, it uses a lightweight retrieve-and-rerank architecture that delivers faster inference and fully deterministic results. The model achieves state-of-the-art performance across all standard benchmarks: IM2GPS3k and YFCC4k for internet photos, and OSV-5M for street-view imagery. This makes Pinpoint a practical tool for applications like forensic image analysis, travel verification, and autonomous navigation—any scenario where pinpointing the location of an unknown photo matters.
- Combines Flickr and street-view imagery via a contrastive embedding trained on both sources.
- Uses an attention-based reranker to fuse visual and GPS features from nearby locations.
- Achieves SOTA on IM2GPS3k, YFCC4k, and OSV-5M benchmarks without any multimodal LLM.
Why It Matters
Faster, reproducible geolocation without LLMs opens up real-world use cases from forensics to autonomous navigation.