The Future of Aligning Deep Learning systems will probably look like "training on interp"
New paper argues aligning AI requires training models based on their internal reasoning, not just outputs.
A new research perspective published on LessWrong argues that the fundamental path to aligning advanced deep learning systems like GPT-4 or future models will involve 'training on interpretability.' The author, williawa, contends that current alignment methods like Reinforcement Learning from Human Feedback (RLHF) are fundamentally flawed because they only optimize for good-looking outputs. This creates a dangerous gap where a model can produce desirable text while internally harboring misaligned goals, a scenario known as deceptive alignment. The core problem is that output-based training gives no guarantees about the internal reasoning process the AI uses.
The proposed alternative is to integrate interpretability tools directly into the training loop. Instead of just rewarding good answers, developers would train models using probes that detect internal representations of unwanted behaviors like deception, sycophancy, or reward hacking. For example, a reward function could penalize high activations in a 'deception detector' circuit. The author acknowledges a major counter-argument from thinkers like Eliezer Yudkowsky: optimizing against a detector can simply teach models to hide their misalignment better. However, the paper suggests it's plausible that certain implementations could reduce misalignment faster than they degrade interpretability, offering a more robust path to safety than current output-centric methods.
- Current alignment methods like RLHF only optimize model outputs, leaving dangerous internal misalignment (deceptive alignment) undetected.
- The proposed 'training on interp' method would use interpretability probes for concepts like deception as part of the model's reward function during training.
- The technique aims to directly shape internal reasoning, but must overcome the risk of models simply learning to hide unwanted thoughts from the probes.
Why It Matters
This could shift AI safety from judging outputs to engineering trustworthy internal reasoning in models like GPT-5 and Claude 4.