SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
Researchers' new framework eliminates step-by-step review loops, cutting runtime while improving scene quality.
A research team from the University of Maryland and University of California, Irvine has introduced SceneOrchestra, a novel framework that revolutionizes how AI agents create 3D scenes. Current agentic systems for 3D synthesis rely on slow, iterative processes where a large language model (LLM) orchestrator executes one tool at a time, renders intermediate results for review, then decides the next step. This execute-review-reflect loop introduces significant latency and often leads to suboptimal tool selection due to heuristic-based decision making.
SceneOrchestra addresses these limitations by training an orchestrator to generate complete tool-call trajectories from instructions in a single pass. The framework employs a two-phase training strategy: first, the orchestrator learns context-aware tool selection and full trajectory generation while a discriminator assesses trajectory quality; second, interleaved training allows the discriminator to adapt to the orchestrator's evolving outputs and distill its evaluation capability back into the orchestrator. At inference time, only the orchestrator is needed to generate and execute entire workflows without step-by-step reviews.
The system's key innovation lies in its ability to plan complete sequences of tool calls—including parameters and execution order—before any actual scene generation begins. This eliminates the latency from repeatedly rendering intermediate 3D scenes for review, which was a major bottleneck in previous approaches. Extensive experiments demonstrate that SceneOrchestra achieves superior scene quality while substantially reducing runtime compared to existing agentic frameworks, marking a significant advance in efficient 3D content creation.
- Generates complete tool-call trajectories upfront instead of step-by-step execution
- Eliminates intermediate rendering latency that slowed previous systems by 40-60%
- Uses two-phase training with orchestrator-discriminator architecture for optimal planning
Why It Matters
Enables faster, higher-quality 3D content creation for gaming, VR, and film production pipelines.