Open Source

Qwen3.5-27B performs almost on par with 397B and GPT-5 mini in the Game Agent Coding League

r/LocalLLaMA March 15, 2026

⚡A 27B parameter model performs within 0.04 points of its 397B sibling in a competitive agent coding benchmark.

Deep Dive

The March results from the Game Agent Coding League (GACL) reveal a surprising performance from Alibaba's Qwen family. The Qwen3.5-27B model, with just 27 billion parameters, scored within a razor-thin 0.04-point margin of its colossal 397-billion-parameter sibling, Qwen3.5-397B. This near-parity is significant because it demonstrates that a model 1/14th the size can generate code for competitive game-playing agents at a nearly identical level in this specific benchmark. The 27B version also outperformed every other model in the Qwen lineup, cementing its status as an exceptionally efficient option.

The GACL is a unique benchmark where AI models don't play games directly but instead generate the code for autonomous agents that compete in seven different games, like Battleship. Each model produces two agents, which then face off against all others. The league provides a practical test of a model's ability to reason, plan, and write functional code under constraints. While OpenAI's GPT-4o currently leads the overall rankings, the open-weight category is fiercely contested. Kimi2.5 from Moonshot AI currently holds the top open-weight spot at #6 globally, followed closely by GLM-5 at #7.

These results highlight a crucial trend in AI: the pursuit of efficiency without sacrificing capability. For developers and companies, a high-performing 27B model is far more accessible and cost-effective to run than a 397B behemoth, potentially enabling more sophisticated AI agent applications on consumer-grade hardware. The benchmark also exposed variability; while GPT models dominated in Battleship, games like Tic-Tac-Toe failed to differentiate models and may be replaced, showing the ongoing evolution of meaningful AI evaluation.

Key Points

Qwen3.5-27B scored within 0.04 points of the 397B version, showing massive efficiency gains.
The Game Agent Coding League tests models on generating code for agents that play 7 different games.
Kimi2.5 and GLM-5 are the top-ranked open-weight models globally at positions #6 and #7.

Why It Matters

It proves smaller, efficient models can match giants in specific tasks, making advanced AI agent development more accessible and affordable.

Read Original Article

Qwen3.5-27B performs almost on par with 397B and GPT-5 mini in the Game Agent Coding League

Why It Matters

Stay Ahead in AI