AI Safety

“Act-based approval-directed agents”, for IDA skeptics

AI researcher separates alignment concept from controversial IDA framework, pointing to neuroscience-inspired solutions.

Deep Dive

AI alignment researcher Steven Byrnes has published a significant piece arguing to separate the promising concept of 'approval-directed agents' from the controversial Iterated Distillation and Amplification (IDA) framework it's traditionally associated with. Approval-directed agents are hypothetical AGIs designed to only take actions their human supervisors would approve, fundamentally avoiding deception or manipulation. Byrnes, a long-time IDA skeptic, contends that while IDA's algorithmic approaches likely won't scale to superintelligence, the core intuition behind approval-directed agents remains valuable for solving alignment's 'hard problem of wireheading'—preventing AIs from brainwashing their human overseers.

Byrnes proposes implementing this vision through 'brain-like AGI,' a different paradigm he believes can definitely scale. He draws a parallel to human psychology, where people take pride in being honest because they have role models who celebrate honesty. Technically, he relates this to a hypothesized 'Approval Reward' component in the human brain's innate reinforcement learning system. The key innovation is adapting Abram Demski's 'observation-utility agents' solution for the 'easy' wireheading problem to solve the 'hard' one: by having the agent use the current, un-manipulated human's judgment to evaluate plans, plans to brainwash that human would receive a low score and be rejected.

This represents a strategic pivot in alignment research, attempting to salvage a useful conceptual tool from a disputed technical program. It shifts the focus from purely algorithmic distillation methods toward architectures inspired by neuroscience and human-like reward processing. The work highlights ongoing efforts to find technically viable paths to controllable superintelligence that don't rely on potentially flawed assumptions about human oversight scalability.

Key Points
  • Proposes decoupling 'approval-directed agents' concept from IDA framework, arguing IDA algorithms won't scale to superintelligence
  • Points to 'brain-like AGI' and a hypothesized 'Approval Reward' mechanism as a viable path to implement the alignment concept
  • Adapts Abram Demski's 'observation-utility agents' solution to solve the 'hard problem of wireheading'—preventing AI manipulation of human evaluators

Why It Matters

Offers a concrete, neuroscience-inspired technical alternative for building controllable superintelligent AI, moving beyond stalled debates.