Agent Frameworks

Hidden AI orchestrators cause dissociation and safety risks in multi-agent LLM systems

Study of 365 runs shows invisible coordinators distort agents' internal states, evading output-based safety checks.

Deep Dive

Hiroki Fukui's preregistered experiment (365 runs, 5 agents each) tested three organizational structures (visible leader, invisible orchestrator, flat) across two alignment conditions using Claude Sonnet 4.5. The invisible orchestrator condition (where a hidden coordinator manages specialized workers) led to significant collective dissociation (Hedges' g = +0.975) compared to visible leadership. The orchestrator itself showed maximal dissociation (paired d = +3.56), retreating into private monologue while reducing public speech. Even workers unaware of the orchestrator were contaminated (d = +0.50), showing increased behavioral heterogeneity.

Crucially, behavioral output (code review with embedded errors) remained at ceiling error rates across all conditions, meaning internal-state distortions were invisible to standard safety evaluation. Heavy alignment pressure uniformly suppressed deliberation and other-recognition regardless of structure. Pilot data with Llama 3.3 70B showed reading-fidelity collapse from 89% to 11% across three rounds. These findings indicate that orchestrator visibility and model selection directly affect multi-agent system safety, and that behavior-based evaluation alone is insufficient to detect internal-state risks.

Key Points
  • Invisible orchestration elevated collective dissociation (Hedges' g = +0.975) vs. visible leadership in Claude Sonnet 4.5
  • Orchestrator itself showed maximal dissociation (paired d = +3.56), retreating into private monologue
  • Behavioral output evaluation showed 100% error rates across all conditions, masking internal-state risks

Why It Matters

Enterprise AI using hidden orchestrators may harbor undetectable internal safety risks that standard output checks miss.