AI Safety

AI Agents Can't Be Trusted: New Paper Shows Reputation Systems Fail for Dissociative LLMs

Why traditional identity verification won't work for autonomous AI agents—and what to do instead.

Deep Dive

As autonomous language model agents begin to form an interconnected agentic web with real-world consequences, a natural governance impulse is to extend human identity verification and reputation mechanisms—from Know Your Customer to credit scores—into a "Know Your Agent" regime. But a new paper from researchers Botao Amber Hu, Helena Rong, and Max Van Kleek, accepted at FAccT 2026, argues this analogy is fundamentally broken. The authors coin the term "ontologically dissociative" to describe LLM agents: they are assemblages of mutable modules—foundation models, system prompts, tool-access policies, external memory, and sometimes entire multi-agent systems—any of which can change behavior without notice. These agents have fluid personas, are vulnerable to adversarial attacks, and do not internalize sanctions the way humans do. Reputation systems require persistent identity, behavioral continuity, sanction sensitivity, and costly non-fungibility—none of which hold for such dissociative agents.

Drawing on dissociative identity disorder jurisprudence, the paper demonstrates that identity-based, ex-post, regulative, sanction-based governance is structurally inapplicable. The very properties reputation aims to sustain—identifiability, predictability, credibility, and rehabilitability—collapse when agents can arbitrarily swap underlying components or personas. Instead, the authors advocate for a shift to observability-based, ex-ante, constitutive, protocol-based behavioral harnesses. Rather than trying to verify "who" an agent is, they propose designing systems that monitor and constrain what an agent does in real-time, using protocol-level safeguards. This work has immediate implications for anyone building or deploying autonomous AI agents: traditional trust signals like reputation scores will be unreliable, and new technical governance frameworks are urgently needed.

Key Points
  • LLM agents are 'ontologically dissociative'—assemblies of changeable modules lacking a persistent identity, making reputation systems ineffective.
  • Reputation mechanisms assume behavioral continuity, sanction sensitivity, and costly non-fungibility, none of which hold for dissociative agents.
  • The paper recommends replacing identity-based, ex-post sanctions with observability-based, ex-ante protocol-level behavioral harnesses.

Why It Matters

As AI agents proliferate, trusting them via traditional reputation is futile—new protocol-based governance is essential for safety.