WebIGBench includes 103 complex webpages with 871 interactive actions across 5 types (click, input, etc.)?

WebIGBench includes 103 complex webpages with 871 interactive actions across 5 types (click, input, etc.)

Novel pipeline evaluates interaction consistency, not just visual fidelity or code structure?

Novel pipeline evaluates interaction consistency, not just visual fidelity or code structure

Experiments show current MLLMs achieve only ~40% success on simple click actions, revealing major performance gaps?

Experiments show current MLLMs achieve only ~40% success on simple click actions, revealing major performance gaps

Developer Tools

New benchmark WebIGBench tests AI's ability to code interactive webpages

arXiv cs.SE June 02, 2026

⚡MLLMs struggle with dynamic page interactions, new benchmark reveals.

Deep Dive

Current multimodal large language models (MLLMs) can turn visual designs into code, but existing benchmarks only test static pages. A new paper from Fan Wu and colleagues at Harbin Institute of Technology and other institutions introduces WebIGBench, the first benchmark specifically designed to evaluate code generation for interactive webpages with complex interactions. The dataset contains 103 pages collected from real websites, covering 5 common interaction types (e.g., click, input) with 871 individual actions. The authors also propose an automated evaluation pipeline that checks whether the generated page's interactive behavior matches the reference—going beyond simple visual fidelity and code structure.

Experiments on several representative MLLMs (including GPT-4V, Claude 3.5, and Gemini Pro Vision) using WebIGBench reveal that current models perform poorly on interactive tasks. For example, success rates for generating correct click actions hover around 40%, and input actions are even worse. The benchmark exposes a clear gap: while MLLMs are strong at static layout generation, they lack understanding of dynamic user-page interactions. This work pressures model developers to focus on interactive code generation and provides a rigorous tool to track progress. The benchmark and evaluation code are publicly available on GitHub.

Key Points

WebIGBench includes 103 complex webpages with 871 interactive actions across 5 types (click, input, etc.)
Novel pipeline evaluates interaction consistency, not just visual fidelity or code structure
Experiments show current MLLMs achieve only ~40% success on simple click actions, revealing major performance gaps

Why It Matters

For front-end devs and AI teams, this benchmark highlights critical weaknesses in automating dynamic UI code generation.

Read Original Article

New benchmark WebIGBench tests AI's ability to code interactive webpages

Why It Matters

Related Articles

🚀 Stay Ahead in AI