HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents
Tencent's new AI models for robots outperform rivals on 16 benchmarks and match Google's Gemini 3.0 Pro.
Tencent's research teams have introduced HY-Embodied-0.5, a significant step toward creating AI that can effectively perceive and act in the physical world. Unlike general-purpose vision-language models, this family is specifically engineered for the demands of embodied agents—robots or software that interact with a real environment. The release includes two core variants: a compact 2-billion-parameter model optimized for deployment on edge devices and a more powerful 32-billion-parameter model designed for sophisticated reasoning tasks. To achieve the fine-grained visual understanding required for tasks like navigation and manipulation, the team adopted a novel Mixture-of-Transformers (MoT) architecture. This design uses latent tokens to enhance perceptual representations, allowing for more effective processing of spatial and temporal visual data.
Beyond perception, the models are built for advanced reasoning, prediction, and planning. The researchers employed an iterative, self-evolving post-training paradigm to boost these capabilities and used on-policy distillation to transfer the large model's skills to the smaller one. In extensive evaluations across 22 benchmarks for visual perception, spatial reasoning, and embodied understanding, the MoT-2B model outperformed state-of-the-art models of similar size on 16 benchmarks. The flagship 32B variant achieved performance comparable to frontier models like Google's Gemini 3.0 Pro. Crucially, the team demonstrated practical utility by using the robust VLM foundation to train a Vision-Language-Action model, which showed compelling results in real-world robot control experiments. The entire project, including code and model weights, has been open-sourced, accelerating research in embodied AI.
- Two specialized models: a 2B-parameter variant for edge deployment and a 32B-parameter variant for complex reasoning.
- Outperformed similarly sized state-of-the-art models on 16 out of 22 benchmarks, with the large model matching Gemini 3.0 Pro.
- Used to train a Vision-Language-Action model that achieved strong results in real-world physical robot control tests.
Why It Matters
This provides a powerful, open-source foundation for developing robots and agents that can reliably perceive, reason, and act in complex physical environments.