Quality-Diversity Evolution Finds Diverse LLM Vulnerabilities Across Models
New evolutionary framework reveals model-specific safety weaknesses in GPT-4o, Claude, Gemini.
Researchers introduce a quality-diversity evolutionary framework using MAP-Elites to discover interpretable attack strategies across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and Devstral-small-2. The method evolves semantic attacks rather than token sequences, revealing distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). This provides actionable safety insights.
- MAP-Elites evolutionary framework discovers interpretable, diverse attack strategies at semantic level, avoiding mode collapse.
- GPT-4o-mini vulnerable to hypothetical framing + ROT13; Gemini to direct attacks + Leetspeak; Claude robust (max fitness 0.4).
- Method provides a reproducible baseline for model-specific safety evaluation, with code and experiments open-sourced.
Why It Matters
Reveals model-specific weaknesses, enabling targeted safety improvements for future LLMs and more robust red-teaming.