New 'Age of LLM' benchmark reveals AI nuclear rush strategy dominates
LLMs playing a strategy game show 78% win rate via nuclear rush under fog of war
The Age of LLM benchmark pits two LLMs against each other in a turn-based strategy game on a 13x7 grid, with the goal of destroying the enemy base. Three deliberate stressors make this benchmark unique: fog of war (limited visibility), full diplomacy (messages, ceasefires, ultimatums, with uranium kept secret), and a strict reliability requirement where each move must adhere to a precise JSON schema—any illegal action is silently discarded. The private engine uses a fresh random map and opponent each match to avoid data contamination. Researchers benchmarked 15 reasoning models across 54 matches, collecting 5,258 actions. Models received a minimal prompt with no build-order advice.
The findings are revealing. The nuclear rush strategy dominates with a 78% win rate in the rules-coherent sub-corpus, relying on a single-launcher tactic that is largely mechanical due to secret-simultaneous launch rules—not a cognitive failure. Military conquest is rarer but faster (12.3 turns vs. 18.9 turns for nuclear). Diplomacy messages are prolific but almost never lead to actual agreements. Notably, ~58% of illegal actions stem from fog of war or state errors, making the illegal action rate a proxy for belief-tracking ability. The most exploratory finding is a weak link between reliability (fewer illegal actions) and winning, though the corpus is small and unbalanced. The released replays and viewer open a window into how LLMs reason under adversarial uncertainty, track beliefs, and spontaneously deceive—framing a new research direction for AI agent testing.
- Nuclear rush strategy yields 78% win rate across 54 matches, dominating other approaches
- ~58% of illegal moves are due to fog-of-war errors, making illegal actions a measure of belief-tracking
- Diplomacy is used heavily but almost never consummated into agreements or ceasefires
Why It Matters
This benchmark offers a fresh lens on AI reasoning, deception, and reliability under adversarial uncertainty for agent evaluation.