Agent Frameworks

AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

AutoRISE evolves attack programs, not just prompts, to jailbreak LLMs 17% better.

Deep Dive

Automated red-teaming for large language models (LLMs) typically optimizes attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. A new paper from researchers introduces AutoRISE, a method that instead optimizes the strategy by searching over executable attack programs rather than individual prompts. At each iteration, a coding agent edits a strategy, and a fixed evaluation harness scores the resulting attacks, returning both a scalar objective and per-example diagnostics that guide subsequent edits. This allows structural changes, including new attack components and altered control flow, that prompt-level methods cannot directly express.

AutoRISE was evaluated on 11 models from five families against seven established jailbreak datasets, using two benchmark suites developed on disjoint target sets. Across held-out models, it improved average attack success rate by 17.0 points over the strongest baseline, and up to 16 points on frontier targets with low baseline success rates. Ablations suggest these gains arise from unrestricted program search, particularly compositional techniques and control-flow edits. Notably, AutoRISE operates in a black-box, inference-only setting, requiring no fine-tuning, human annotation, or GPU compute, making it highly accessible for security researchers.

Key Points
  • AutoRISE searches over executable attack programs, not just prompts, enabling structural strategy changes like new components and control flow edits.
  • It improves average attack success rate by 17.0 points over the strongest baseline across 11 models and 7 jailbreak datasets.
  • Operates in a black-box, inference-only setting with no need for fine-tuning, human annotation, or GPU compute.

Why It Matters

AutoRISE automates and improves LLM red-teaming, making security testing more effective and accessible for researchers.