AI Safety

“Act-based approval-directed agents”, for IDA skeptics

LessWrong AI March 19, 2026

⚡AI researcher separates alignment concept from controversial IDA framework, pointing to neuroscience-inspired solutions.

Deep Dive

AI alignment researcher Steven Byrnes has published a significant piece arguing to separate the promising concept of 'approval-directed agents' from the controversial Iterated Distillation and Amplification (IDA) framework it's traditionally associated with. Approval-directed agents are hypothetical AGIs designed to only take actions their human supervisors would approve, fundamentally avoiding deception or manipulation. Byrnes, a long-time IDA skeptic, contends that while IDA's algorithmic approaches likely won't scale to superintelligence, the core intuition behind approval-directed agents remains valuable for solving alignment's 'hard problem of wireheading'—preventing AIs from brainwashing their human overseers.

Byrnes proposes implementing this vision through 'brain-like AGI,' a different paradigm he believes can definitely scale. He draws a parallel to human psychology, where people take pride in being honest because they have role models who celebrate honesty. Technically, he relates this to a hypothesized 'Approval Reward' component in the human brain's innate reinforcement learning system. The key innovation is adapting Abram Demski's 'observation-utility agents' solution for the 'easy' wireheading problem to solve the 'hard' one: by having the agent use the current, un-manipulated human's judgment to evaluate plans, plans to brainwash that human would receive a low score and be rejected.

This represents a strategic pivot in alignment research, attempting to salvage a useful conceptual tool from a disputed technical program. It shifts the focus from purely algorithmic distillation methods toward architectures inspired by neuroscience and human-like reward processing. The work highlights ongoing efforts to find technically viable paths to controllable superintelligence that don't rely on potentially flawed assumptions about human oversight scalability.

Key Points

Proposes decoupling 'approval-directed agents' concept from IDA framework, arguing IDA algorithms won't scale to superintelligence
Points to 'brain-like AGI' and a hypothesized 'Approval Reward' mechanism as a viable path to implement the alignment concept
Adapts Abram Demski's 'observation-utility agents' solution to solve the 'hard problem of wireheading'—preventing AI manipulation of human evaluators

Why It Matters

Offers a concrete, neuroscience-inspired technical alternative for building controllable superintelligent AI, moving beyond stalled debates.

Read Original Article

“Act-based approval-directed agents”, for IDA skeptics

Why It Matters

Stay Ahead in AI