Neuralese Boosts AI Alignment via Verifiable Reward Training
RLVR with exact grading makes chain-of-thought robust against reward hacking.
Reinforcement Learning with Verifiable Rewards (RLVR) trains chain-of-thought reasoning on exactly graded problems like coding and formal math, where automated grading is error-free and hard to cheat. For subjective tasks like humor, inexactly graded problems require a reward model trained on human labels, but this opens the door to adversarial exploits like prompt-injecting the grader. The article notes that exactly graded problems can bootstrap capabilities beyond human skill, while inexactly graded problems make optimizing chain-of-thought difficult due to grading subjectivity.
- RLVR uses verifiable rewards (e.g., unit tests) to train chain-of-thought reasoning without human labels.
- Inexact grading requires reward models vulnerable to adversarial exploitation and sparse supervision.
- Neuralese allows direct optimization of internal reasoning, making alignment training more robust and scalable.
Why It Matters
Verifiable reward training with neuralese reduces alignment risks from reward model gaming.