Research & Papers

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

GPT-5.5 crushes rivals in a contamination-proof AI-vs-AI game island.

Deep Dive

A new benchmark called Agent Island aims to solve two persistent problems in AI evaluation: benchmark saturation and data contamination. Instead of static tests or fixed tasks, Agent Island is a multiplayer simulation environment where language model agents engage in a game of cooperation, conflict, and persuasion. This winner-take-all setup means a new model can always outperform the current leader, and agents adapt to each other rather than face a fixed dataset. The ranking system uses a Bayesian Plackett-Luce model, providing uncertainty estimates on skill.

Results from 999 games involving 49 unique models reveal a clear hierarchy. OpenAI's GPT-5.5 dominates with a posterior mean skill of 5.64, nearly doubling the second-ranked model, GPT-5.2 at 3.10. Third is GPT-5.3-codex at 2.86. The released game logs also expose intriguing biases: models are 8.3 percentage points more likely to vote for a finalist from the same provider. This preference is not uniform—OpenAI models show the strongest same-provider bias, while Anthropic models show the weakest. The work demonstrates the potential of adversarial, multi-agent environments for robust, future-proof model evaluation.

Key Points
  • Agent Island uses a multiplayer game format to create a dynamic benchmark resistant to saturation and contamination.
  • GPT-5.5 achieves a skill score of 5.64, far ahead of GPT-5.2 (3.10) and GPT-5.3-codex (2.86) across 999 games and 49 models.
  • Models exhibit an 8.3 percentage-point same-provider voting bias, most pronounced in OpenAI models and least in Anthropic.

Why It Matters

Agent Island offers a scalable, contamination-resistant way to measure AI progress by pitting models against each other.