Models & Releases

GPT-5.4 beating all other top models by far in Game Agent Coding League

OpenAI's latest model leads a competitive benchmark where AI writes code for game-playing agents.

Deep Dive

OpenAI's GPT-5.4 has emerged as the clear leader in the March iteration of the Game Agent Coding League (GACL), a specialized benchmark that tests large language models on their ability to generate functional code for autonomous game-playing agents. The league pits models against each other across seven different games, but with a twist: the models don't play the games themselves. Instead, each model generates the code for two separate AI agents, which then compete in the games. Only the top-performing agent from each model contributes to the final leaderboard score.

GPT-5.4's victory wasn't narrow; it significantly outperformed other major proprietary models. Notably, GPT-5.3-Codex placed second and was described as "way ahead" of Anthropic's Claude 3.5 Sonnet. The results also highlighted the performance of leading open-weight models, with Kimi2.5 (from Moonshot AI) achieving an impressive global rank of #6, followed closely by GLM-5 at #7. The benchmark revealed game-specific strengths, with GPT models particularly dominating in Battleship, while Tic-Tac-Toe proved less useful as a discriminator since nearly all models performed similarly.

The GACL provides a unique and practical evaluation of a model's coding and agent-design capabilities, moving beyond simple code completion to assess the creation of competitive, strategic software. All game logs, scoreboards, and the actual agent code generated by each model are publicly available, offering transparency and a resource for developers. The league organizer has indicated plans to replace Tic-Tac-Toe with a more challenging game next month to further refine the benchmark.

Key Points
  • GPT-5.4 leads the March GACL, outperforming GPT-5.3-Codex, Claude Sonnet, and Gemini 3 Flash.
  • The open-weight Kimi2.5 model ranked #6 globally, demonstrating strong performance against closed models.
  • The benchmark tests code generation for game-playing agents across seven games, with all results and code publicly available.

Why It Matters

This benchmark provides a real-world test of AI coding and strategic reasoning, crucial for developing autonomous software agents.