AgenticRed: Evolving Agentic Systems for Red-Teaming
New evolutionary algorithm autonomously designs attack systems that flawlessly break top AI models.
A research team from UC Berkeley and Google DeepMind has published a groundbreaking paper on arXiv titled 'AgenticRed: Evolving Agentic Systems for Red-Teaming.' The work addresses a critical bottleneck in AI safety: the reliance on human-specified workflows for automated red-teaming, which introduces bias and is expensive to scale. Instead, AgenticRed treats red-teaming as a system design problem. It leverages large language models' in-context learning capabilities within an evolutionary algorithm framework to autonomously generate, test, and refine attack strategies. This creates a self-improving pipeline that explores a vast design space without manual input.
The results are stark. Systems evolved by AgenticRed consistently outperformed state-of-the-art human-designed methods. On the HarmBench safety evaluation, it achieved a 96% attack success rate (ASR) on Llama-2-7B, 98% on Llama-3-8B, and a perfect 100% on Qwen3-8B. Most impressively, the query-agnostic systems it generated demonstrated strong transferability to the latest, most powerful proprietary models, achieving a 100% ASR on GPT-5.1, DeepSeek-R1, and DeepSeek V3.2. This indicates the attacks are finding fundamental, model-agnostic vulnerabilities rather than exploiting quirks of specific architectures.
This research represents a paradigm shift. By automating the *design of the testing system itself*, AgenticRed provides a method for safety evaluation that can theoretically keep pace with the rapid evolution of AI models. It highlights evolutionary algorithms as a powerful tool for proactive AI safety, moving beyond static test suites to dynamic, adaptive threat discovery. The ability to achieve perfect attack rates on top-tier models underscores both the sophistication of the approach and the persistent vulnerabilities in even the most advanced AI systems.
- Autonomously evolves red-teaming systems using LLMs and evolutionary algorithms, removing human bias from the design process.
- Achieved 100% attack success rate on leading proprietary models including OpenAI's GPT-5.1 and DeepSeek's latest models.
- Generates robust, query-agnostic attack systems that transfer effectively, proving a scalable method for safety testing.
Why It Matters
Provides a scalable, automated framework for finding critical AI safety vulnerabilities before malicious actors do, essential for securing next-gen models.