Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing
New self-supervised model trains on a single GPU, outperforming standard MAE for satellite imagery analysis.
A research team led by Maofeng Tang from the University of Tennessee, Knoxville, has introduced Cross-Scale MAE, a novel self-supervised learning model specifically designed for the complexities of remote sensing imagery. The model addresses unique challenges like vast geographic coverage, hardware constraints, and misaligned multi-scale images by building upon the popular Masked Auto-Encoder (MAE) framework. During pre-training, it employs scale augmentation and enforces cross-scale consistency through a combination of contrastive and generative losses. This ensures the learned visual representations are consistent and meaningful across different resolutions, making them highly adaptable for various downstream applications.
A key technical achievement is the model's efficiency. The implementation leverages the xFormers library to accelerate network pre-training, enabling it to run on a single GPU without sacrificing the quality of the learned features. Experimental results on benchmark datasets demonstrate that Cross-Scale MAE exhibits superior performance compared to the standard MAE and other cutting-edge remote sensing MAE methods. This advancement provides a more scalable and effective foundation model for the field, reducing the dependency on large, labeled datasets.
The work, detailed in a paper on arXiv, represents a significant step in applying self-supervised learning—a technique where models learn from data without human-provided labels—to the domain of geospatial analysis. By creating a model that better understands the multi-scale nature of satellite and aerial imagery, it paves the way for more accurate and automated tools for environmental monitoring, urban planning, and disaster response. The public availability of the research facilitates further development and application by the broader AI and remote sensing communities.
- Built on the Masked Auto-Encoder (MAE) framework with novel scale augmentation and cross-scale consistency losses.
- Leverages xFormers library for efficient training, achieving state-of-the-art results using only a single GPU.
- Outperforms standard MAE and other specialized models on remote sensing downstream tasks like classification and detection.
Why It Matters
Enables more accurate, efficient AI analysis of satellite imagery for climate science, agriculture, and defense without massive labeled datasets.