Robotics

OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language

A new framework lets robots plan routes by interpreting natural language commands on satellite imagery.

Deep Dive

A research team from the University of Texas at Austin and the Army Research Laboratory has unveiled OVerSeeC, a novel framework that enables autonomous systems to generate navigation costmaps directly from satellite imagery using natural language instructions. The system addresses a critical gap in long-range robotic planning, where mission requirements are fluid and terrain features may be unknown at deployment.

Technically, OVerSeeC employs a three-stage 'Interpret-Locate-Synthesize' pipeline. First, a large language model (LLM) parses the user's natural language prompt to extract specific entities and ranked traversal preferences (e.g., 'avoid residential areas, prefer paved roads'). Second, an open-vocabulary segmentation model, such as Grounding DINO or SAM, identifies and masks these entities within high-resolution satellite imagery. Finally, the LLM synthesizes this information—the user's preferences and the visual masks—into executable Python code that produces a detailed costmap for a path planner. This modular, zero-shot approach allows the system to handle novel objects and complex, compositional logic (like 'and'/'or' statements) that static, ontology-based systems cannot.

In practical terms, this means a human operator can simply tell a delivery or exploration robot to 'plan a route that avoids muddy fields and stays near tree lines for cover' over a 10-square-mile area, and OVerSeeC will translate that into a usable navigation plan. The team's empirical results show the framework produces routes consistent with human-drawn trajectories across diverse geographic regions, demonstrating robustness to distribution shifts. This research, published on arXiv, represents a significant step toward scalable, mission-adaptive global planning by effectively composing existing foundation models for vision and language.

Key Points
  • Uses a 3-stage pipeline: LLM interpretation, open-vocabulary segmentation on satellite imagery, and code synthesis for costmaps.
  • Enables zero-shot handling of novel terrain entities and complex, compositional natural language instructions.
  • Demonstrated robustness across diverse regions, producing routes that align with human-drawn trajectories.

Why It Matters

Enables more flexible, intuitive command of autonomous vehicles for logistics, search & rescue, and exploration over large, unfamiliar areas.