DINO Soars: DINOv3 for Open-Vocabulary Semantic Segmentation of Remote Sensing Imagery
DINOv3 backbone eliminates need for expensive labeled RS data, outperforming fine-tuned methods
A new research paper from Ryan Faulkenberry and Saurabh Prasad introduces CAFe-DINO (Cost Aggregation + Feature Upsampling with DINO), an open-vocabulary semantic segmentation (OVSS) model tailored for remote sensing (RS) imagery. The key innovation is leveraging the DINOv3 vision transformer backbone, which already surpasses state-of-the-art RS foundation models on the GEO-bench segmentation benchmark without any pretraining on RS data. CAFe-DINO extends this capability with a cost aggregation module and a training-free upsampling mechanism for text-image similarity scores, enabling it to segment arbitrary categories described in natural language. The model is fine-tuned only on a carefully curated RS-targeted subset of COCO-Stuff, avoiding the need for expensive, densely labeled RS datasets. Despite this constraint, CAFe-DINO achieves state-of-the-art performance on key RS segmentation datasets, outperforming existing OVSS methods that were fine-tuned on RS data.
This work addresses a critical bottleneck in remote sensing: the scarcity of labeled training data. Traditional supervised methods require thousands of manually annotated images, which are costly and time-consuming to produce. By leveraging DINOv3's robust latent representations, CAFe-DINO demonstrates that strong foundation models can generalize to the RS domain with minimal domain-specific tuning. The open-vocabulary nature of the model means users can query for any land cover class or object type using natural language, without retraining. This makes CAFe-DINO highly practical for real-world applications such as precision agriculture, urban expansion monitoring, disaster response, and environmental conservation. The code and data are publicly available, enabling rapid adoption and further research in cost-effective RS analysis.
- CAFe-DINO uses DINOv3 backbone to achieve SOTA on GEO-bench without any remote sensing pretraining
- Open-vocabulary segmentation via cost aggregation and training-free upsampling of text-image similarity scores
- Fine-tuned only on RS-targeted COCO-Stuff, outperforming OVSS methods that use full RS fine-tuning
Why It Matters
Enables accurate, flexible land cover mapping without expensive labeled datasets, cutting costs for geospatial analytics.