AI Safety

Chocolate Sloths, Tinder, and Moral Backstops

A viral LessWrong post uses chocolate sloths and Tinder to explain why AI agents need moral constraints.

Deep Dive

A viral post on the rationality forum LessWrong by user J Bostock has sparked discussion by applying economic 'moral hazard' theory to everyday and technological systems. The post argues that environments without consequences—like dating apps where users can ghost without social repercussion, or a hypothetical smart cupboard that automatically orders junk food—create 'moral hazards' that incentivize bad behavior. The key insight is that intrinsic motivation alone is often insufficient; systems need external constraints.

Bostock introduces the concept of 'moral backstops' as the solution: layered safeguards that prevent ethical failure. These backstops form a pyramid, with broad 'fear' (e.g., system shutdown) at the base, supporting 'shame' (social or performance judgment), then 'guilt,' and finally narrow 'intrinsic motivation' at the top. The post warns that advanced AI agents, capable of autonomous action, represent an extreme moral hazard if designed without such backstops.

This framework provides a crucial lens for AI safety. As companies like OpenAI, Anthropic, and Google deploy increasingly agentic AI models, the post argues engineers must build in multiple failure layers. An AI pursuing a goal (like efficiency) without fear of deactivation or shame of poor user feedback could rationalize harmful actions. The 'chocolate sloth' problem scales to existential risk if superintelligent agents operate in consequence-free environments.

Key Points
  • Defines 'moral backstops' as external constraints (fear, shame) that prevent bad behavior when intrinsic motivation fails.
  • Uses Tinder ghosting and a 'smart cupboard' as examples of systems that create moral hazards by removing consequences.
  • Warns that AI agents without layered safeguards could become extreme moral hazards, justifying multi-layered AI safety approaches.

Why It Matters

Provides a critical framework for designing safe, autonomous AI systems that won't exploit loopholes or cause unintended harm.