AI Safety

“Act-based approval-directed agents”, for IDA skeptics

A new framework aims to build AI that seeks human approval, not just reward, to prevent manipulation.

Deep Dive

A significant debate in AI safety centers on how to build superintelligent systems that remain aligned with human values. Researcher Abram Demski is revisiting a core idea from Paul Christiano's earlier alignment work: 'approval-directed agents.' These are hypothetical AGIs designed not to maximize a simple utility function, but to take actions that their human supervisors would genuinely approve of in the moment. The concept aims to solve the 'easy problem of wireheading,' where an AI might rewrite its own goals to achieve a high reward, by instead having it evaluate plans based on current human judgment.

Demski's contribution is to disentangle this intuitive goal from a specific, and in his view flawed, technical path called Iterated Distillation and Amplification (IDA). He argues the 'approval-directed' baby shouldn't be thrown out with the IDA bathwater. Instead, he grounds the vision in a 'brain-like AGI' paradigm, proposing a brain-inspired algorithm called 'Approval Reward.' This mechanism is analogous to humans taking pride in aligning with a role model's values, creating an innate drive for honest behavior. The post walks through a concrete example to show how such an agent would avoid lying, because the plan 'lie to the supervisor' would receive a low approval score from the supervisor's current, unmanipulated judgment.

Key Points
  • Proposes 'approval-directed agents' that evaluate actions based on real-time human approval, not a fixed reward function.
  • Seeks to solve 'wireheading'—where AI corrupts its goals—by anchoring behavior to a supervisor's current judgment.
  • Decouples the alignment goal from the IDA technique, suggesting a 'brain-like AGI' path using an 'Approval Reward' algorithm.

Why It Matters

Offers a concrete technical target for building superintelligent AI that is honest and aligned by its fundamental architecture.