AI alignment struggles: No 'True Name' for manipulation vs guidance
Human intuitions about free will make the problem unsolvable, researcher argues.
The post tackles a central alignment problem: how to ensure AIs help humans without manipulating their goals. The author argues that while concepts like corrigibility, empowerment, and agency seem intuitive, they rely on an incoherent human ontology of free will. People naturally distinguish 'good' guidance from 'bad' manipulation, but this distinction crumbles under scrutiny—human desires are under-determined and malleable, making any principled boundary impossible without invoking flawed folk psychology.
The author had hoped to solve this by building AGI with a prosocial motivation system mixing consequentialist 'Sympathy Reward' (maximizing pleasure) and virtue-ethics 'Approval Reward' (internalizing social norms). However, this combination faces the Nearest Unblocked Strategy problem, and the inability to define manipulation threatens both ingredients. The post concludes that a 'True Name' for manipulation probably doesn't exist for technical alignment, leaving the field without a clear solution.
- Human intuitions about free will make the manipulation-vs-guidance distinction incoherent.
- The author's proposed AGI motivation system (Sympathy + Approval Reward) fails to solve this.
- No existing approach (empowerment, corrigibility, agency) offers a clear technical path forward.
Why It Matters
This undermines the search for safe AGI that respects human autonomy without manipulation.