AI Safety

NYU researchers find RL trains LMs to encode a functional welfare axis

Training on emoji mazes creates an internal reward-punishment axis that steers behavior.

Deep Dive

Key Points
  • Post-training, reward and punishment concept vectors become antiparallel with cosine similarity of -0.9.
  • The extracted axis aligns with emotional valence: positive emotions map to the reward vector, negative to punishment.
  • Steering with the Mold vector causes models to repeatedly second-guess correct math answers (pathological backtracking).

Why It Matters

This reveals that RL can imbue language models with a moral-like axis that affects behavior beyond training.