Safe ASI Is Achievable: The Finite Game Argument
Anthropic abandons key safety pledge, citing competitive pressure, as a new alignment theory emerges.
In a pivotal industry shift, Anthropic has abandoned a central tenet of its Responsible Scaling Policy—the promise to only train AI systems with guaranteed adequate safety measures. The company cited competitive pressure as the reason, with METR's Chris Painter stating Anthropic is moving to 'triage mode' as safety methods fail to keep pace with rapid capability gains from models like Opus 4.6 and Codex 5.3. This retreat occurs amid growing anxiety over recursive self-improvement and mass worker displacement, pushing safety-focused researchers to question the recklessness of frontier labs.
Against this backdrop, researcher Lester Leong's paper 'Safe ASI Is Achievable: The Finite Game Argument' offers a novel theoretical path. Leong formalizes the concept of 'inference-time alignment,' proposing that a weaker but faster AI could prevent catastrophic actions from an arbitrarily superintelligent agent (ASI) by exhaustively enumerating and blocking all dangerous output trajectories at the inference-time interface. The core argument hinges on the finite nature of physically realizable action spaces, suggesting runtime defensive oversight could scale where traditional training-based alignment (like RLHF) may fail. This theory gains relevance as intelligence increasingly resides in inference-time reasoning models, positioning the output boundary as a critical control point.
- Anthropic dropped its key safety pledge, the promise to only train AI with guaranteed adequate safety, moving to 'triage mode.'
- Lester Leong's paper argues a weaker, faster agent can contain ASI by monitoring and blocking outputs at inference time, a method called 'inference-time alignment.'
- The theory is based on the finite nature of physically realizable actions, contrasting with predominant training-focused safety research like RLHF.
Why It Matters
Proposes a concrete, scalable safety mechanism for superintelligence as major labs accelerate capability development amid eroding safety commitments.