Untrusted Monitoring is Default; Trusted Monitoring is not
New analysis argues proving AI safety is too hard; companies will default to cheaper 'honeypot' monitoring instead.
In a viral analysis on LessWrong, AI researcher J Bostock presents a contrarian view on how companies will manage AI safety. He argues that the industry's assumed default—'trusted monitoring,' where models are rigorously proven safe before use—is collapsing under practical constraints. As models like Anthropic's Claude and OpenAI's GPT-4o rapidly advance, comprehensive safety evaluations (evals) are becoming too costly, time-consuming, and politically fraught. Bostock points to real-world evidence: companies have scaled back capability evals, Claude has 'maxed out' several rule-out tests, and assessments are shifting to methods like employee interviews. The result is a 'blurry' and obfuscated picture of model trustworthiness.
Bostock contends this creates a 'slow slide' into a new default: 'untrusted monitoring.' Instead of proving a model is safe upfront, companies will deploy powerful models while using cheaper, post-hoc methods to monitor for misalignment. This involves techniques like 'honeypots'—setting traps to detect deceptive behavior—and treating monitoring as a 'limited-scope' trust problem. The fundamental shift is from preventing bad actions to detecting and containing them after they occur. This approach is driven by economics and feasibility, as proving a model isn't 'coherently pursuing long-term misaligned goals' is exceptionally difficult, especially against adversarial jailbreaks and distributional shifts post-deployment.
- Proving AI models are 'trusted' (safe) before deployment is becoming too hard and expensive, leading to scaled-back evaluations.
- Companies like Anthropic and OpenAI are already showing signs of this shift, with evals being reduced or maxed out for models like Claude.
- The new default will be 'untrusted monitoring'—using cheaper methods like honeypots to catch misbehavior after AI is deployed.
Why It Matters
This signals a major, pragmatic shift in AI safety strategy from prevention to detection, with significant implications for how risky models are governed.