Using Game Theory to Make AI Safer by Managing Human Incentives
Researchers propose treating AI safety like a high-stakes game of cat and mouse.
A new paper argues that current AI safety methods are too static, focusing only on model tuning. It proposes using Stackelberg Security Games, a game theory framework, to model the strategic interaction between AI overseers and potential attackers. This approach aims to make oversight proactive, helping to allocate limited auditing resources, defend against data poisoning, and ensure robust deployment in adversarial environments.
Why It Matters
This could make AI systems more resilient to manipulation and failure by accounting for real-world human incentives.