DexSim2Real: Foundation Model-Guided Sim-to-Real Transfer for Generalizable Dexterous Manipulation
Vision-language models guide robot hands from simulation to reality with record accuracy.
Sim-to-real transfer remains a major bottleneck for deploying learned robotic manipulation policies in the real world. Existing methods like manual domain randomization or task-specific adaptation often fail to generalize across diverse scenarios. DexSim2Real, developed by researchers, tackles this by integrating vision-language foundation models into a unified framework. The system employs three core components: Foundation Model-Guided Domain Randomization (FM-DR) uses a vision-language model as a visual realism critic, optimizing simulation parameters via closed-loop CMA-ES (Covariance Matrix Adaptation Evolution Strategy) for direct visual feedback; a Tactile-Visual Cross-Attention Policy (TVCAP) that fuses tactile and visual data for zero-shot reinforcement learning; and a Progressive Skill Curriculum (PSC) that leverages LLM-based task decomposition with a difficulty scheduler tailored to contact-rich manipulation.
Extensive experiments across six challenging manipulation tasks with blinded evaluation show DexSim2Real achieves a 78.2% average real-world success rate, significantly outperforming prior state-of-the-art systems like DrEureka and DeXtreme. The framework reduces the sim-to-real performance gap to just 8.3%, a dramatic improvement over existing methods. By combining visual-language reasoning with tactile feedback and structured skill progression, DexSim2Real demonstrates that foundation models can effectively bridge the simulation-reality divide for dexterous manipulation, enabling more robust and generalizable robot control in unstructured environments.
- Uses a vision-language model as a visual realism critic to optimize simulation parameters via CMA-ES, replacing manual domain randomization.
- Combines tactile-visual cross-attention policy for zero-shot reinforcement learning, fusing sensory data from fingertips and cameras.
- Achieves 78.2% average real-world success rate across six dexterous tasks, reducing the sim-to-real gap to only 8.3%.
Why It Matters
Enables reliable robotic dexterity in real-world applications, cutting simulation-to-reality performance loss dramatically.