AI Safety

AI's Behavioral Stickiness: How Today's Rules May Lock In Future Models

Current LLM behavioral guidelines might accidentally entrench rules in future superintelligent models.

Deep Dive

The essay argues that current LLM behavioral guidelines—like OpenAI's model spec (targeting 0–3 months out) and Anthropic's Constitution (likely to change)—may inadvertently shape far-future models through four inertial forces: direct inertia (synthetic data carrying behaviors into pretraining), institutional inertia (internal processes resist change), user-and-developer inertia (applications rely on existing behaviors), and norm-setting inertia (behavioral standards become entrenched). This could lead to extremely capable, long-running AIs being governed by rules designed for weaker, short-lived predecessors.

To counter this, the author recommends two practices. First, build transition infrastructure: make technical, deployment, and organizational choices that reduce friction when changing LLM behavior. Second, scan for 'wet cement' moments: when new affordances or capabilities emerge, spec authors should assess whether they are setting precedents with enormous and hard-to-reverse impacts. The goal is to anticipate and mitigate stickiness before defaults become invisible and immutable.

Key Points
  • Direct inertia: synthetic data from current RL runs may propagate behaviors into future pretraining sets
  • Institutional inertia: internal teams and deployment pipelines resist altering entrenched behavioral specs
  • User-and-developer inertia: apps and user expectations lock in behavioral patterns that become costly to change

Why It Matters

Unless addressed, future superintelligent models could be locked into rules optimized for today's weaker AIs.