It Is Reasonable To Research How To Use Model Internals In Training
A key AI safety technique is wrongly seen as forbidden, sparking debate.
Deep Dive
AI researcher Neel Nanda argues that using a model's internal workings during its training is a reasonable and active area of safety research, contrary to some beliefs labeling it a 'forbidden technique.' He states it could be crucial for ensuring advanced AI systems behave as intended, especially when desired outcomes are hard to specify. The approach includes adding internal metrics to the training loss to guide model behavior.
Why It Matters
This research could provide vital new tools for controlling and understanding future, more powerful AI systems.