Research & Papers

A Theoretical Game of Attacks via Compositional Skills

Researchers formalize adversarial prompting as a game, proving attackers have structural advantages.

Deep Dive

Authors Xinbo Wu, Huan Zhang, Abhishek Umrawal, and Lav R. Varshney introduce a theoretical framework that models attacks and defenses on LLMs as a formal game. They derive a best-response attack strategy closely related to many existing adversarial prompting methods and characterize the game's equilibria, showing attackers have inherent advantages. They also derive a provably optimal defense strategy. Empirically, their practical attack outperforms existing methods across multiple LLMs and benchmarks.

Key Points
  • Formalizes LLM attacks as a compositional skills game, revealing inherent attacker advantage in game equilibria.
  • Proposes a provably optimal defense strategy alongside the best-response attack.
  • Empirical results show the theoretical attack outperforms existing adversarial methods across multiple LLMs and benchmarks.

Why It Matters

Demonstrates fundamental limits of current LLM alignment, urging stronger defenses grounded in game theory.