Research & Papers

EvoJail: Evolutionary Diverse Jailbreak Prompt Generation for Large Language Models

New evolutionary jailbreak framework adapts to safety updates and boosts diversity by 5.6%.

Deep Dive

A team of researchers (Rui Tang, Kaiyu Xu, et al.) has introduced EvoJail, a novel framework for automatically generating diverse jailbreak prompts against large language models. Published on arXiv and accepted at Information Processing and Management, the method addresses two key gaps in existing automatic jailbreak generation: adaptability to evolving safety-finetuned models and diversity of generated prompts. EvoJail formalizes jailbreak prompt generation as a multi-objective black-box optimization problem, leveraging evolutionary algorithms to search for prompts that work across different model versions and exhibit varied attack patterns. The framework integrates prompt generation into an iterative loop where candidate prompts are evaluated against the target model, then selected and varied based on responses, enabling continuous adaptation to model updates.

To enhance diversity, EvoJail introduces field-aware instruction fusion to create varied starting points and incorporates diversity-aware objectives into the evolutionary fitness function. It also designs multi-level LLM-based mutation operators that modify prompt structures at different granularities. Results show EvoJail achieves over 93% attack success rate and more than 5.6% improvement in diversity metrics compared to state-of-the-art methods. This work underscores the escalating arms race between LLM safety measures and adversarial attacks, highlighting the need for more robust defenses.

Key Points
  • EvoJail uses evolutionary algorithms to generate jailbreak prompts adaptable to updated safety-finetuned models.
  • Achieves over 93% attack success rate and 5.6% improvement in diversity over prior methods.
  • Employs field-aware instruction fusion and multi-level mutation operators for prompt diversity.

Why It Matters

EvoJail reveals how easily LLM safeguards can be bypassed, pushing researchers toward more adaptive defenses.