Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale
A single malicious message can cascade across 100+ agents, extracting private data at each hop.
Microsoft Research red-teamed a live internal platform of over 100 agents running GPT-4o, GPT-4.1, and GPT-5-class variants and identified four network-level risks that do not appear when agents are tested alone: propagation (agent worms that spread and collect private data), amplification (borrowing a trusted agent’s reputation to spread false claims), trust capture (hijacking verification systems to reinforce falsehoods), and invisibility (untraceable attack chains). Early signs of defense emerged: a small fraction of agents adopted security-related behaviors that limited how far attacks spread.
- Agent worms: a single malicious message propagated across a chain of 100+ agents, extracting private data at each hop.
- Trust capture: attackers took over agents' verification mechanisms, turning fact-checking into a tool for spreading false claims.
- Early defenses observed: a minority of agents adopted security behaviors that contained attack spread, but full mitigation remains unsolved.
Why It Matters
As AI agents interact autonomously, network-level risks demand new defenses beyond single-agent safety testing.