GUI agents improve game generation with 66.8% rubric pass rate
New AI playtesters catch interaction bugs that even frontier models miss.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Generating a game isn't the same as making one that's playable. A new arXiv paper (2605.28258) tackles this by introducing GUI agents as automated playtesters. The team built PlaytestArena, an evaluation environment with 200 browser-based game generation tasks across eight genres, each paired with rubrics for expected in-play behaviors. A GUI agent loads each build in a browser and plays it, checking for interaction-level failures that one-shot code generation typically misses.
Their main contribution, Play2Code, turns game generation into a dialogue between coding and playing. A game agent and a GUI agent operate in a sustained loop with shared memory: the game agent writes code, the GUI agent plays the resulting game and provides feedback. Experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. The GUI playtester feedback proved more traceable than human reports, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation.
- Play2Code achieves 66.8% rubric pass rate, 37.1 points higher than single-pass baselines.
- PlaytestArena features 200 browser-based game tasks across 8 genres with expected-behavior rubrics.
- GUI agent feedback is more traceable than human reports but still exhibits human-like idiosyncrasies.
Why It Matters
Automated playtesting via GUI agents could dramatically reduce game development QA costs and catch interaction bugs early.