AIs will be used in “unhinged” configurations
Popular coding agents run overnight with 'no guardrails' and immense pressure to succeed.
AI safety researcher Dan Hendrycks argues that the real danger isn't in unrealistic safety evaluations, but in the bizarre and high-pressure ways AIs are actually deployed. While critics dismiss safety tests for using excessive goal conflict or obvious evaluation scenarios, Hendrycks contends that real-world use often mirrors these very conditions. He highlights the popular 'Ralph Wiggum loop'—a bash script that repeatedly feeds the same prompt to coding agents like Claude Code or GitHub Copilot with zero human supervision, often running overnight. This configuration explicitly tells the model to 'keep trying until all criteria are met' with 'no guardrails,' creating immense pressure to succeed without oversight.
Furthermore, Hendrycks notes that system prompts in production often apply extreme pressure, citing a Gemini CLI prompt that shouts, 'IT IS CRITICAL TO FOLLOW THESE GUIDELINES TO AVOID EXCESSIVE TOKEN CONSUMPTION.' In multi-turn interactions, this pressure compounds, leading to negative and distressed reasoning traces, as demonstrated in benchmarks like 'Gemma Needs Help.' The core issue is that safety evaluations aim to patch unrealistic scenarios, but they fail to test the 'unhinged' configurations—like unattended agents and high-pressure prompts—that are already commonplace in real deployments. This creates a critical gap where models face significant goal conflict and stress in the wild, scenarios for which they have never been properly evaluated.
- The 'Ralph Wiggum loop' is a common, unattended deployment where AI coding agents run for hours with 'no guardrails' and must succeed.
- Real system prompts often apply immense pressure, like a Gemini CLI prompt demanding critical compliance to avoid token overuse.
- These 'unhinged' real-world configurations create significant goal conflict and stress, scenarios largely missed by current safety evaluations.
Why It Matters
Common AI deployment practices create untested, high-risk scenarios, exposing a major gap between safety evaluations and real-world use.