Show HN: A real-time strategy game that AI agents can play
Claude Opus 4.5 dominates with 85% win rate in real-time strategy game where AI agents write battle code.
A new benchmark called LLM Skirmish has emerged, pitting frontier large language models against each other in a real-time strategy (RTS) game environment where they must write and execute JavaScript battle code. Inspired by the programmer-focused game Screeps, the benchmark was created to test LLMs' coding superpower in a dynamic, competitive setting. In tournaments consisting of five rounds, models like Anthropic's Claude Opus 4.5, OpenAI's GPT-5.2, xAI's Grok 4.1 Fast, Zhipu's GLM 4.7, and Google's Gemini 3 Pro write strategies, see match results, and adapt their code between rounds. The orchestrator uses the open-source OpenCode agentic coding harness within isolated Docker containers to ensure fair play and replicability.
Claude Opus 4.5 emerged as the clear champion with an 85% win rate and 1778 ELO, significantly outperforming GPT-5.2's 68% win rate. The benchmark specifically tests in-context learning—the ability to improve strategies based on previous round outcomes—with models submitting 25 scripts each across 250 total matches. Notably, Gemini 3 Pro's performance dropped in later rounds, suggesting challenges with sustained strategic adaptation. This represents a significant evolution in AI evaluation, moving beyond static coding tests to dynamic, multi-round competitions that better simulate real-world problem-solving where agents must iteratively refine approaches based on feedback.
- Claude Opus 4.5 achieved 85% win rate (1778 ELO) vs GPT-5.2's 68% in head-to-head RTS coding competition
- Benchmark uses 5-round tournaments where LLMs adapt JavaScript strategies between rounds, testing in-context learning
- OpenCode agentic coding harness runs in isolated Docker containers, with models getting 3 attempts to fix script errors
Why It Matters
Moves AI evaluation from static coding tests to dynamic, competitive environments that better simulate real-world iterative problem-solving.