PhysBrain 1.0 uses human video to teach robots physical common sense
Scaling physical commonsense from human interaction video to robot action.
A team of researchers from multiple institutions, led by Shijie Lian, has released PhysBrain 1.0, a technical report detailing a new approach to imbuing robots with physical commonsense. Rather than relying solely on robot trajectory data—which is expensive and limited in coverage—the team leverages massive amounts of human egocentric video. Their data engine automatically extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then converts this unstructured video into structured question-answer supervision. This data is used to train vision-language models (VLMs) that acquire robust physical priors.
The resulting physical knowledge is then transferred to vision-language-action (VLA) policies through a capability-preserving and language-sensitive adaptation design. PhysBrain 1.0 achieves state-of-the-art results across multiple benchmarks: ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa. Notably, it shows especially strong out-of-domain performance on SimplerEnv, demonstrating that scaling physical commonsense from human video can effectively bridge multimodal understanding to real robot action. This work suggests a promising path for training generalist robots without needing millions of robot demonstrations.
- Uses large-scale human egocentric video instead of robot trajectory data to learn physical commonsense
- Data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations for VLM training
- Achieves SOTA on ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa with strong out-of-domain generalization
Why It Matters
Scaling physical commonsense from human video could dramatically reduce the cost of training general-purpose robots.