AI Safety

Decision theory doesn’t prove that useful strong AIs will doom us all

Utility maximizers don't have to eat the world—history-based preferences change everything.

Deep Dive

A new post on LessWrong by 'deep' (linkpost from expectedsurprise.substack.com) challenges the common AI safety argument that useful strong AIs must be unbounded utility maximizers that 'eat the world.' The author argues that training for optimal behavior does not inevitably produce WorldSUM agents—act-utilitarian optimizers with no side constraints, unbounded scope, and insatiable resource hunger. Instead, people and AI labs will prefer deploying agents with virtue-ethics or deontological approaches for two reasons: traditional misalignment concerns and distrust of even well-intentioned AIs getting calculations right (like human subordinates). Crucially, agents can maximize a utility function without being world-eaters by having preferences over their own actions and entire trajectories, not just material world states.

For example, a finance agent can be designed to care about the legality of its trades over pure profit—still optimizing a utility function but with built-in constraints. The author formally shows that with RL, agents can have utility functions over (action, state) pairs that mimic deontological rules. Such bounded agents can be competitive at human-level tasks (e.g., caring only about code quality) without needing to grab all resources. However, as we approach ASI, these factors weaken: it becomes harder to distill 'nice' agents from WorldSUM agents, and responsible users will increasingly want broadly-scoped AIs. Still, building non-WorldSUM agents could ensure safety during early intelligence explosions and provide reliable advisors for a human-level pause.

Key Points
  • Agents with action-based utility functions (e.g., caring about legality, not just profit) avoid unbounded resource hunger.
  • Bounded utility functions work at par-human level—a code-writing AI need only care about code quality, not world domination.
  • Non-WorldSUM agents are more attractive to labs due to misalignment fears and distrust of perfect calculation, even with 'right values'.

Why It Matters

Challenges the inevitability of AI doom, opening practical design paths for safe, bounded strong AIs.