AI Safety

Hidden Role Games as a Trusted Model Eval

New evaluation method uses Mafia-style games to measure AI's ability to scheme and deceive.

Deep Dive

AI safety researcher James Lucassen has proposed a novel evaluation framework using hidden role games like Mafia to test AI models for dangerous capabilities. The method specifically measures FASUU (Flexible Adversarial Strategy Under Uncertainty) - a critical capability where models must navigate deception, uncertainty, and adversarial reasoning simultaneously. This approach emerged from observing that current advanced models like GPT-4 and Claude perform poorly at Machiavellian gameplay despite excelling at technical tasks like coding and research automation.

Hidden role games create scenarios where AI agents must deduce hidden allegiances, manipulate others, and maintain deception under pressure - mirroring the complex reasoning required for dangerous scheming behavior. The evaluation serves as a complement to existing safety benchmarks like SSE (Subsystem Safety Evaluation), providing a more realistic test of how AI might behave in adversarial situations. Lucassen's research suggests this capability gap could be crucial for determining when AI systems become untrustworthy and potentially dangerous as they approach automated R&D capabilities.

Key Points
  • Proposes hidden role games like Mafia as benchmark for FASUU (Flexible Adversarial Strategy Under Uncertainty)
  • Current models show poor deception/strategy skills despite strong coding/research capabilities
  • Method complements existing safety evals by testing realistic adversarial scenarios

Why It Matters

Provides early warning system for dangerous AI capabilities before models can automate harmful research.