Robotics

SAGE drone system finds objects 13.7x faster using language commands

CLIP-powered drone explores unknown indoor spaces 13.7x faster than prior methods

Deep Dive

Researchers Nitin Vegesna and Avideh Zakhor have unveiled SAGE (Semantic-Aware Guided Exploration), a drone system that combines volumetric mapping with open-vocabulary object detection. Building on the FALCON explorer, SAGE integrates CLIP (Contrastive Language-Image Pre-training) through four components: object-centric embedding storage, a temporal cache for recent observations along free-unknown boundaries, object frontiers for high-similarity detections, and a unified semantic-geometric cost function. This design ensures that semantic cues reprioritize exploration frontiers without sacrificing total coverage—a key improvement over prior methods that either ignore semantics or over-prioritize them, leaving large areas unmapped.

In Matterport3D-based simulations, SAGE outperformed both FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to FTU (Finding Things in the Unknown), SAGE completed exploration 9.0 to 25.9 times faster across nine shared pairs, with a mean speedup of 13.7× and substantially higher volumetric throughput. The system was also validated in five real-world flights using a Modal AI Starling 2 quadrotor with onboard sensing and planning, offloading CLIP inference to a ground station. While FALCON alone achieved faster exploration and shorter trajectories, SAGE excelled at actually finding the requested objects, demonstrating that semantic guidance can be leveraged without catastrophic coverage loss.

Key Points
  • SAGE uses CLIP for open-vocabulary object detection, allowing natural language commands like 'find a fire extinguisher'
  • Achieved 13.7× mean speedup over FTU in object discovery tasks while maintaining 99%+ volumetric coverage
  • Deployed on a Modal AI Starling 2 quadrotor with real-time onboard planning and offboard CLIP inference

Why It Matters

Enables drones to find specific objects in unknown buildings using natural language, useful for search-and-rescue and inventory