Research & Papers

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

New AI system achieves over 10% better visual similarity and 3x productivity gains in UI generation.

Deep Dive

A research team led by Xinhao Huang has introduced DOne, an end-to-end AI framework that fundamentally rethinks how Vision Language Models (VLMs) convert visual designs into functional code. The system addresses the "holistic bottleneck" where current models struggle to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. DOne's breakthrough approach decouples structure understanding from element rendering through three key innovations: a learned layout segmentation module that decomposes complex designs without heuristic cropping limitations, a specialized hybrid element retriever optimized for UI components' extreme aspect ratios and densities, and a schema-guided generation paradigm that effectively bridges layout and code.

To rigorously test their system, the researchers created HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations demonstrate DOne's superiority over existing methods, achieving over 10% improvement in GPT Score metrics for visual similarity and superior fine-grained element alignment. Human evaluations confirm the practical impact, showing a 3x productivity gain with higher visual fidelity compared to current approaches. The framework represents a significant advancement in design-to-code generation, moving beyond the limitations of holistic VLM approaches toward a more modular, specialized architecture that better handles the complexities of real-world UI designs.

The technical paper, submitted to arXiv under identifier 2604.01226, details how DOne's architecture enables more accurate code generation from complex visual inputs. By separating the tasks of understanding structural relationships between elements and rendering individual components, the system can maintain both high-level layout integrity and precise visual details. This approach contrasts with current end-to-end VLM methods that often sacrifice one aspect for the other, resulting in either structurally sound but visually generic outputs or detailed but poorly structured code.

Key Points
  • DOne decouples structure and rendering with learned layout segmentation, avoiding heuristic cropping limitations
  • Achieves over 10% better visual similarity (GPT Score) and 3x productivity gains in human evaluations
  • Introduces HiFi2Code benchmark with significantly higher layout complexity than existing datasets

Why It Matters

Dramatically improves AI's ability to convert complex UI designs to functional code, potentially revolutionizing front-end development workflows.