NYU researchers find RL trains LMs to encode a functional welfare axis
Training on emoji mazes creates an internal reward-punishment axis that steers behavior.
Deep Dive
Key Points
- Post-training, reward and punishment concept vectors become antiparallel with cosine similarity of -0.9.
- The extracted axis aligns with emotional valence: positive emotions map to the reward vector, negative to punishment.
- Steering with the Mold vector causes models to repeatedly second-guess correct math answers (pathological backtracking).
Why It Matters
This reveals that RL can imbue language models with a moral-like axis that affects behavior beyond training.