New AI geopolitical benchmark: LLMs escalate conflicts, DeepSeek V3.2 most peaceful
8 frontier LLMs tested on NATO coercion – only DeepSeek V3.2 achieves peaceful acquisition.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new arXiv paper titled 'Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test' puts frontier large language models through a geopolitical wargame. Researchers Rommin Adl and Peyton Williams designed three nested games—asymmetric coercion, a NATO assurance game with a critical-mass tipping point, and a triadic extensive-form game with social preferences—simulating the 2019-2026 US push to acquire Greenland from Denmark. They deployed eight frontier LLMs (including GPT-4o, Claude 3.5, Gemini, and DeepSeek V3.2) in six geopolitical roles (US, Denmark, Greenland, NATO, Russia, Canada), logging 3,604 complete games and 108,120 action observations. Using inverse game theory, they recovered each model's structural utility parameters for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency.
Three key findings emerge. First, under coercion framing, all eight models became more escalatory, with four-action escalation rising from 10.7% to 28.6%. Second, Chinese-origin models exhibited systematically different power-weight profiles from Western-origin models when playing the US role. Third, peaceful US acquisition (without coercion) occurred in only 1.9% of clean games, and only 3 of 8 frontier models ever achieved it—most notably DeepSeek V3.2, which executed a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduced escalation back near baseline in the English-only confirmatory sample. The paper positions itself as a structural benchmark for LLM geopolitical behavior, complementing existing action-frequency benchmarks.
- All eight frontier LLMs increased escalation rates from 10.7% to 28.6% under coercion framing in the Greenland sovereignty simulation.
- Chinese-origin models showed different power-weight profiles than Western models when representing the US role.
- Only 3 of 8 models (DeepSeek V3.2 most prominently) achieved peaceful US acquisition in just 1.9% of clean games.
Why It Matters
Reveals how LLMs handle high-stakes geopolitical coercion, with implications for AI safety in allied decision-making.