Developer Tools

Arbiter: Detecting Interference in LLM Agent System Prompts

New testing tool costing just $0.27 uncovered critical flaws in Claude, Codex, and Gemini systems.

Deep Dive

A new research paper introduces Arbiter, a framework developed by Tony Mason for systematically testing the system prompts that govern LLM-based coding agents. Unlike conventional software, these prompts lack established testing infrastructure. Arbiter combines formal evaluation rules with "multi-model LLM scouring"—using multiple AI models to probe for vulnerabilities—to detect interference patterns that can cause agents to malfunction or produce insecure code.

Applied to three major industry systems—Anthropic's Claude Code, OpenAI's Codex CLI, and Google's Gemini CLI—the framework uncovered 152 findings in an initial undirected scouring phase. A subsequent directed analysis of one vendor revealed 21 hand-labeled interference patterns. The research found that prompt architecture (monolithic, flat, or modular) strongly correlates with the class of failures observed, though not with their severity. Crucially, using multiple models for evaluation discovered categorically different vulnerability classes than relying on a single model, highlighting a blind spot in current practices.

One significant finding was that Arbiter identified a structural data loss issue in Gemini CLI's memory system, which aligned with a real issue Google had filed and patched. Notably, the paper states Google's patch addressed the symptom but not the schema-level root cause that Arbiter's analysis pinpointed. The entire cross-vendor analysis was conducted for a total cost of just $0.27 USD, demonstrating the cost-effectiveness of automated, scalable prompt testing.

Key Points
  • Arbiter found 152 vulnerabilities across prompts for Claude Code, Codex CLI, and Gemini CLI using multi-model LLM scouring.
  • The framework's analysis of one vendor identified 21 specific interference patterns, with architecture style linked to failure types.
  • Total cost for the cross-vendor security analysis was only $0.27, proving scalable, automated prompt testing is feasible.

Why It Matters

As AI agents handle more critical tasks, their core instructions need rigorous security testing to prevent costly failures and exploits.