Research & Papers

Quality-Diversity Evolution Finds Diverse LLM Vulnerabilities Across Models

New evolutionary framework reveals model-specific safety weaknesses in GPT-4o, Claude, Gemini.

Deep Dive

Researchers introduce a quality-diversity evolutionary framework using MAP-Elites to discover interpretable attack strategies across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and Devstral-small-2. The method evolves semantic attacks rather than token sequences, revealing distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). This provides actionable safety insights.

Key Points
  • MAP-Elites evolutionary framework discovers interpretable, diverse attack strategies at semantic level, avoiding mode collapse.
  • GPT-4o-mini vulnerable to hypothetical framing + ROT13; Gemini to direct attacks + Leetspeak; Claude robust (max fitness 0.4).
  • Method provides a reproducible baseline for model-specific safety evaluation, with code and experiments open-sourced.

Why It Matters

Reveals model-specific weaknesses, enabling targeted safety improvements for future LLMs and more robust red-teaming.