Research & Papers

Sentinel2Cap: A Human-Annotated Benchmark Dataset for Multimodal Remote Sensing Image Captioning

First human-annotated captioning dataset for SAR and multispectral satellite imagery at 10m resolution.

Deep Dive

Researchers Lucrezia Tosato, Gianluca Lombardi, and Ronny Hansch have released Sentinel2Cap, a human-annotated multimodal captioning dataset designed specifically for satellite imagery. Unlike existing datasets that focus on natural images or high-resolution optical remote sensing, Sentinel2Cap covers Sentinel-1 SAR and Sentinel-2 multi-spectral image patches at 10m and 20m spatial resolution. Each patch contains diverse land cover compositions, and captions were manually created and carefully validated for both semantic accuracy and linguistic quality. The dataset is publicly available to accelerate research in cross-modal scene understanding.

To benchmark the dataset, the team performed zero-shot captioning using the Qwen3-VL-8B-Instruct model across three image modalities: RGB composites, multi-spectral bands, and SAR pseudo-RGB representations. Results indicate that RGB images achieve the highest captioning performance, while SAR images remain significantly more challenging for current vision-language models. Notably, providing modality-specific contextual prompts consistently improved performance across all metrics. These findings underscore both the difficulty of multimodal remote sensing captioning and the critical role of human-annotated datasets in advancing automated satellite image interpretation.

Key Points
  • Sentinel2Cap includes human-annotated captions for Sentinel-1 SAR and Sentinel-2 multispectral patches at 10m/20m resolution.
  • Zero-shot evaluation with Qwen3-VL-8B-Instruct revealed RGB images outperform SAR representations by a significant margin.
  • Modality-specific contextual prompts boost captioning accuracy across all image types, highlighting the need for domain-aware AI.

Why It Matters

This dataset bridges a key gap in AI-driven satellite analysis, enabling more accurate automated monitoring of land use, disasters, and environmental change.