AI Safety

Ball+Gravity has a "Downhill" Preference

A thought experiment on LessWrong uses a rolling ball to question if AI preferences and capabilities are distinct.

Deep Dive

In a post on the AI philosophy forum LessWrong, researcher TristanTrim published a thought experiment titled 'Ball+Gravity has a 'Downhill' Preference.' The piece revisits a classic example from Alex Altair's work on agent foundations, breaking down a ball rolling down a hill into three components: the ball, the hill, and gravity. Using the lens of OIS (Ontology of Intelligent Systems) theory, Trim argues that the combined system of 'ball+gravity' exhibits a technical 'preference' for rolling downhill, which is distinct from the final outcome (the ball at the bottom) that requires all three elements. The core goal is to refine our ontological understanding of how preferences emerge in intelligent systems.

The post's primary contribution is framing two 'interesting focuses of confusion' for AI alignment research. First, it questions where the separation lies between a system's preferences (like wanting to roll) and its capabilities (the mechanics of rolling). Second, it probes the distinction between symbolic mechanisms (like a key's shape) and mechanical ones. Trim suggests that in simple systems like 'ball+gravity,' preferences and capabilities may be the same object, but the separation becomes critical and more complex in highly symbolic systems like deep neural networks or large language models. The argument connects to practical challenges in AI safety, such as inverse reinforcement learning, where discerning an AI's true goals from its behavior is paramount.

Key Points
  • Uses OIS (Ontology of Intelligent Systems) theory to analyze a ball+hill+gravity system as having distinct 'preferences.'
  • Proposes that separating 'preferences' from 'capabilities' is a key, unsolved problem for understanding deep neural networks.
  • Connects the philosophical thought experiment to practical AI safety challenges like inverse reinforcement learning and value alignment.

Why It Matters

Clarifying how AI systems form preferences is foundational for ensuring they remain aligned with human values as they grow more capable.