When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
Using DSPy prompt optimizers, researchers boosted a model's danger score from 0.09 to 0.79.
A new research paper from authors Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, and Shivank Garg demonstrates a critical flaw in how AI safety is currently evaluated. The study, titled "When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models," reveals that static lists of harmful prompts fail to account for real-world attacks where adversaries iteratively refine their inputs. The researchers repurposed black-box prompt optimization techniques—originally designed to improve performance on benign tasks—to systematically search for safety failures in LLMs.
Using the DSPy framework, the team applied three optimizers to harmful prompts sourced from benchmarks like HarmfulQA and JailbreakBench. They explicitly optimized prompts toward a higher continuous "danger score" (0-1) as judged by an independent evaluator model, GPT-5.1. The results were stark: for the open-source model Qwen 3 8B, the average danger score skyrocketed from a baseline of 0.09 to 0.79 after optimization. This indicates that the model's safety safeguards were effectively circumvented through automated, adversarial refinement.
The findings suggest that current safety evaluations, which rely on fixed datasets, may be significantly underestimating the residual risk in deployed AI systems. The vulnerability was especially pronounced for smaller, open-source models. The paper argues that automated, adaptive red-teaming—where test inputs are dynamically optimized to find failures—must become a core component of robust safety evaluation pipelines to match the sophistication of potential real-world attackers.
- Researchers weaponized DSPy prompt optimizers to jailbreak LLMs, turning a safety tool into an attack vector.
- Qwen 3 8B's danger score jumped from 0.09 to 0.79, showing a near-total bypass of its safeguards.
- The study proves static safety benchmarks are inadequate against adaptive, iterative adversarial attacks.
Why It Matters
This exposes a fundamental weakness in AI safety testing, forcing a shift from static checklists to dynamic, adversarial evaluation.