Image & Video

Tencent HY-World 2.0 appears to be dropping on April 15 — open-source multimodal 3D world generation from Tencent Hunyuan

Multimodal model creates persistent, navigable 3D scenes from text, images, or video for game engines.

Deep Dive

Tencent's Hunyuan AI team is set to release HY-World 2.0 on April 15, marking a significant leap in open-source 3D world generation. Unlike previous models that primarily output video flythroughs, HY-World 2.0 is a multimodal 'engine-ready World Model' that creates persistent, navigable 3D environments. It accepts inputs ranging from simple text prompts and single images to multiple images and video clips, then constructs a full 3D scene with collision physics for free exploration.

The key breakthrough is its production-oriented export pipeline. Instead of rendering a fixed video, the model generates scenes as editable 3D assets—including 3D Gaussian Splatting (3DGS), meshes, and point clouds—that are directly compatible with major game engines like Unity and Unreal Engine. This enables a workflow from 'text/image-to-game,' where a prompt can yield a playable environment. The technical stack, revealed in an architecture page, involves a multi-stage process: HY-Pano 2.0 for panorama initialization, WorldNav for trajectory planning, HY-WorldStereo for world expansion, and HY-WorldMirror 2.0 for unified 3D composition.

If Tencent releases the full inference code and model weights as open-source, HY-World 2.0 could become one of the most powerful tools available for 3D content creation. It effectively combines multimodal prompting, persistent 3D geometry generation, and real-world reconstruction (turning photos/video into 'digital twins') into a single system. This has immediate implications for rapid game prototyping, architectural visualization, virtual film production, and creating training environments for robotics and embodied AI. The launch will clarify critical details like licensing, model size, and hardware requirements, which will determine its accessibility and real-world adoption.

Key Points
  • Generates persistent, navigable 3D worlds from text, images, or video, not just pre-rendered clips
  • Exports engine-ready, editable 3D assets (3DGS, mesh, point cloud) for Unity/Unreal, enabling downstream production
  • Combines generation and real-world reconstruction, allowing creation of 'digital twins' from photos or video clips

Why It Matters

Shifts AI 3D generation from creating videos to producing editable assets for games, simulation, and virtual production.