OrchJail framework exploits multi-step tool chains to jailbreak T2I agents
By learned orchestration patterns, benign steps combine into unsafe outputs.
Tool-calling text-to-image (T2I) agents can autonomously plan and execute multi-step tool chains (e.g., generating, editing, compositing) to handle complex user requests. This capability, however, creates a dangerous new attack surface: harmful outputs can arise from the orchestration of individually benign tool calls. Traditional prompt-based jailbreaks fail here because the danger lies in the sequence, not the prompt text. OrchJail, introduced by a team led by Jianming Chen, is a fuzzing framework specifically designed to exploit these tool-orchestration vulnerabilities. Its core innovation is learning from successful jailbreak traces and modeling causal relationships between prompt wording and unsafe tool behaviors, then using those insights to guide the fuzzing search. This avoids relying on surface-level textual perturbations and directly targets the orchestration patterns most likely to produce unsafe results.
Experiments across multiple representative tool-calling T2I agents show OrchJail significantly outperforms existing attacks: it achieves higher jailbreak success rates, produces images with better fidelity (i.e., less degraded quality from the attack), and uses fewer queries per attack. Crucially, it remains effective even against common defense mechanisms, indicating that current safeguards are ill-equipped to handle orchestration-level threats. The paper, submitted to ICML 2026, highlights tool orchestration as a critical, previously unexplored attack surface and provides a systematic framework for uncovering safety risks in increasingly autonomous multimodal AI agents. This work underscores the need for new defensive strategies that consider the entire tool chain, not just individual steps.
- OrchJail uses orchestration-guided fuzzing, learning from causal relationships between prompt wording and unsafe multi-step tool behaviors.
- Achieves higher attack success rates and better image fidelity than prompt-only jailbreak methods across representative tool-calling T2I agents.
- Robust against common jailbreak defenses, revealing a critical gap in current safety measures for multi-step AI agents.
Why It Matters
As AI agents gain tool-calling autonomy, orchestration-level vulnerabilities demand new defenses beyond prompt filtering.