DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion
A single AI model learns both meter-scale navigation and centimeter-scale grasping from just 5 minutes of self-supervised data.
A research team led by Iana Zhura has introduced DiffusionAnything, a breakthrough framework that unifies robot navigation and manipulation into a single, efficient AI model. Unlike current vision-language-action (VLA) models that demand massive computational resources and extensive training data, this diffusion-based policy learns from just 5 minutes of self-supervised data per task. It operates purely from RGB camera input, requiring only 2.0 GB of memory while running at 10 Hz, making it suitable for onboard deployment on real robots.
Three key innovations enable its performance: multi-scale FiLM conditioning allows the single model to switch between meter-scale navigation and centimeter-scale manipulation; trajectory-aligned depth prediction focuses 3D reasoning along generated waypoints; and self-supervised attention from AnyTraverse enables goal-directed inference without needing vision-language models or depth sensors. The result is a system that achieves robust zero-shot generalization to completely novel environments while being dramatically more data-efficient than existing approaches.
- Learns from only 5 minutes of self-supervised data per task, compared to massive datasets required by current VLA models
- Unifies navigation and manipulation in one model via multi-scale feature modulation, operating at 10 Hz with 2.0 GB memory
- Achieves zero-shot generalization to novel scenes using only RGB input, eliminating need for depth sensors or vision-language models
Why It Matters
Dramatically lowers the data and compute barrier for training versatile robots that can navigate and manipulate in real-world environments.