Play2Code achieves 66.8% rubric pass rate, 37.1 points higher than single-pass baselines?

Play2Code achieves 66.8% rubric pass rate, 37.1 points higher than single-pass baselines.

PlaytestArena features 200 browser-based game tasks across 8 genres with expected-behavior rubrics?

PlaytestArena features 200 browser-based game tasks across 8 genres with expected-behavior rubrics.

GUI agent feedback is more traceable than human reports but still exhibits human-like idiosyncrasies?

GUI agent feedback is more traceable than human reports but still exhibits human-like idiosyncrasies.

Developer Tools

GUI agents improve game generation with 66.8% rubric pass rate

arXiv cs.SE May 28, 2026

⚡New AI playtesters catch interaction bugs that even frontier models miss.

Deep Dive

Generating a game isn't the same as making one that's playable. A new arXiv paper (2605.28258) tackles this by introducing GUI agents as automated playtesters. The team built PlaytestArena, an evaluation environment with 200 browser-based game generation tasks across eight genres, each paired with rubrics for expected in-play behaviors. A GUI agent loads each build in a browser and plays it, checking for interaction-level failures that one-shot code generation typically misses.

Their main contribution, Play2Code, turns game generation into a dialogue between coding and playing. A game agent and a GUI agent operate in a sustained loop with shared memory: the game agent writes code, the GUI agent plays the resulting game and provides feedback. Experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. The GUI playtester feedback proved more traceable than human reports, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation.

Key Points

Play2Code achieves 66.8% rubric pass rate, 37.1 points higher than single-pass baselines.
PlaytestArena features 200 browser-based game tasks across 8 genres with expected-behavior rubrics.
GUI agent feedback is more traceable than human reports but still exhibits human-like idiosyncrasies.

Why It Matters

Automated playtesting via GUI agents could dramatically reduce game development QA costs and catch interaction bugs early.

Read Original Article

GUI agents improve game generation with 66.8% rubric pass rate

Why It Matters

Related Articles

🚀 Stay Ahead in AI