AI firms' offline monitoring of internal agents lacks transparency
New study reveals gaps in how Anthropic and OpenAI test their AI agent monitors.
A new report from the GovAI Winter Fellowship evaluates the offline monitoring practices of frontier AI companies like OpenAI and Anthropic for internally deployed AI agents. These companies use separate AI 'monitors' to review transcripts of agent actions after execution, flagging suspicious activity for human review. To test these monitors, they employ synthetic attacks—simulated harmful actions a misaligned AI might take. However, the report finds that current public reporting is insufficient for external parties to assess overall effectiveness. Anthropic reports how often monitors flag synthetic attacks but not how many survive the full pipeline to human review. OpenAI reports that its monitor flags all transcripts independently flagged by human employees, but this means little if employees miss subtle misbehavior.
The report makes several concrete recommendations to improve transparency. Companies should report the fraction of model outputs monitored across deployment types, the fraction of synthetic attack transcripts that reach human review, the fraction of those escalated by reviewers, and the time lag between an attack and detection. It also stresses that synthetic attacks must closely resemble hypothetical real attacks for estimates to be meaningful. Without such data, the AI community cannot independently verify that offline monitoring is robust enough to catch catastrophic misalignment from internal AI agents, which are increasingly used for safety research and model training.
- OpenAI and Anthropic use offline monitors to review internal AI agent transcripts after execution.
- Current public reporting lacks data on how many synthetic attacks survive the full monitoring pipeline to human review.
- Report recommends disclosing monitoring rates, synthetic attack detection rates, escalation rates, and detection lags.
Why It Matters
Without transparent monitoring metrics, the AI industry cannot independently assess risks from misaligned internal agents.