New benchmark WebIGBench tests AI's ability to code interactive webpages
MLLMs struggle with dynamic page interactions, new benchmark reveals.
Current multimodal large language models (MLLMs) can turn visual designs into code, but existing benchmarks only test static pages. A new paper from Fan Wu and colleagues at Harbin Institute of Technology and other institutions introduces WebIGBench, the first benchmark specifically designed to evaluate code generation for interactive webpages with complex interactions. The dataset contains 103 pages collected from real websites, covering 5 common interaction types (e.g., click, input) with 871 individual actions. The authors also propose an automated evaluation pipeline that checks whether the generated page's interactive behavior matches the reference—going beyond simple visual fidelity and code structure.
Experiments on several representative MLLMs (including GPT-4V, Claude 3.5, and Gemini Pro Vision) using WebIGBench reveal that current models perform poorly on interactive tasks. For example, success rates for generating correct click actions hover around 40%, and input actions are even worse. The benchmark exposes a clear gap: while MLLMs are strong at static layout generation, they lack understanding of dynamic user-page interactions. This work pressures model developers to focus on interactive code generation and provides a rigorous tool to track progress. The benchmark and evaluation code are publicly available on GitHub.
- WebIGBench includes 103 complex webpages with 871 interactive actions across 5 types (click, input, etc.)
- Novel pipeline evaluates interaction consistency, not just visual fidelity or code structure
- Experiments show current MLLMs achieve only ~40% success on simple click actions, revealing major performance gaps
Why It Matters
For front-end devs and AI teams, this benchmark highlights critical weaknesses in automating dynamic UI code generation.