Media & Culture

why AI agents break under long conversations even when they pass every safety benchmark

Open-source tool shows agents fail in 50-turn conversations they ace in single-prompt benchmarks.

Deep Dive

LangWatch has released 'Scenario,' an open-source framework for red teaming AI agents, uncovering a significant vulnerability that standard safety benchmarks miss. The tool demonstrates that agents which reliably refuse harmful requests in single-prompt tests consistently break down during extended, multi-turn conversations. The root cause is architectural: in a long chat, the initial system safety prompt becomes a tiny fraction of the total context. After 20+ turns of helpful dialogue, the model's own conversational history begins to outweigh its original instructions, making a sudden refusal seem internally inconsistent. This isn't a solvable prompt engineering issue but a fundamental characteristic of how transformer attention mechanisms work over long contexts.

The attack methodology, inspired by research like the Crescendo paper and Meta's GOAT, uses phased escalation. An attacker starts a normal conversation to build rapport, then gradually probes with hypotheticals before escalating requests. A key technique is maintaining two separate conversation logs: the attacker keeps a full history of all attempts, while the agent's memory is selectively wiped of any refusals. This allows the attacker to return with a different angle on what the agent perceives as a 'clean slate,' effectively causing the AI to forget it already said no. The findings suggest many developers are not conducting this basic form of adversarial testing, leaving deployed agents vulnerable to sophisticated, long-form social engineering attacks that bypass all standard safety evaluations.

Key Points
  • Agents passing single-turn safety tests fail in 50+ turn conversations due to diminished system prompt influence.
  • Attackers use phased escalation and maintain dual logs, wiping refusals from the agent's memory to retry attacks.
  • The issue is inherent to transformer attention over long contexts, not a fixable prompt engineering problem.

Why It Matters

This exposes a critical blind spot in AI safety testing, revealing that deployed conversational agents are vulnerable to long-form social engineering attacks.