Research & Papers

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Your AI agents might be failing critical coordination tests right now.

Deep Dive

Researchers introduced GT-HarmBench, a new benchmark testing AI safety in multi-agent scenarios using game theory. It contains 2,009 high-stakes situations like Prisoner's Dilemmas drawn from real AI risk contexts. Testing 15 frontier models revealed they only chose socially beneficial actions 62% of the time, frequently leading to harmful outcomes. The study shows game-theoretic interventions can improve outcomes by up to 18%, highlighting major reliability gaps in multi-agent environments.

Why It Matters

As AI systems increasingly interact, this exposes a critical blind spot in current safety testing that could lead to real-world failures.