Post-training, reward and punishment concept vectors become antiparallel with cosine similarity of -0.9?

Post-training, reward and punishment concept vectors become antiparallel with cosine similarity of -0.9.

The extracted axis aligns with emotional valence?

positive emotions map to the reward vector, negative to punishment.

Steering with the Mold vector causes models to repeatedly second-guess correct math answers (pathological backtracking).

AI Safety

LessWrong AI May 31, 2026

⚡Training on emoji mazes creates an internal reward-punishment axis that steers behavior.

Deep Dive

Key Points

Post-training, reward and punishment concept vectors become antiparallel with cosine similarity of -0.9.
The extracted axis aligns with emotional valence: positive emotions map to the reward vector, negative to punishment.
Steering with the Mold vector causes models to repeatedly second-guess correct math answers (pathological backtracking).

This reveals that RL can imbue language models with a moral-like axis that affects behavior beyond training.