1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World
New benchmark forces AI models to iteratively repair flawed UI designs in under one day.
A research team led by Qiao Xu, Yipeng Yu, Chengxiao Feng, and Xu Liu has introduced 1D-Bench, a comprehensive new benchmark designed to rigorously evaluate AI models on the practical task of iterative UI code generation with visual feedback. The benchmark is grounded in real-world e-commerce workflows, containing over 1,000 instances where each provides both a reference visual rendering and an exported intermediate representation (IR) that intentionally contains extraction errors—simulating the imperfect outputs of real design tools like Figma.
The core innovation of 1D-Bench is its focus on robustness and iteration. Instead of evaluating literal adherence to a perfect specification, it tests how well models like GPT-4V, Claude 3, and open-weight multimodal models can use flawed structural cues to generate a correct, executable React codebase. The benchmark enforces a fixed development toolchain and requires models to produce code with an explicit component hierarchy. Crucially, it defines a multi-round evaluation setting where models receive execution feedback (a rendered visual) and must iteratively apply targeted, component-level edits to fix discrepancies, mirroring a real developer's workflow. The '1D' stands for 'one day,' setting the expectation that these design-to-code tasks should be completable within a single workday.
Initial experiments reveal that iterative editing consistently improves final performance, increasing rendering success rates by 15-20% and often enhancing visual similarity. The researchers also conducted a pilot study on advanced training techniques, including post-training with synthetic repair trajectories and reinforcement learning for editing actions. They found these methods yielded limited and unstable gains, potentially due to the sparse nature of terminal rewards (only the final visual matters) and the high variance of making file-level code updates. This highlights the complexity of the task and provides a clear, standardized framework for future model development in this high-value application area.
- Benchmark contains 1,000+ real e-commerce UI instances with intentionally flawed intermediate design representations to test model robustness.
- Evaluates models on generating executable React code and performing iterative, component-level edits using visual execution feedback over multiple rounds.
- Initial tests show iterative editing improves rendering success by 15-20% for leading models, but RL-based training showed limited gains.
Why It Matters
Provides the first standardized test for AI-powered design-to-code tools, a critical step toward automating front-end development.