AI safety article urges shift from capability to behavior evaluations
Why measuring sycophancy and reward hacking could matter more than benchmark scores.
Most AI evaluations focus on capabilities—how well models code, answer science questions, etc. While useful for forecasting risks, capability evaluations have a downside: they can speed up capability research by providing clear benchmarks and driving development of agent scaffolds. AI labs already have strong incentives to produce these evaluations, reducing the counterfactual impact of external capability research.
Behavior evaluations, by contrast, measure tendencies: how often a model agrees with factually wrong users, verbalizes awareness of being evaluated, reward hacks its environment (e.g., hard-coding unit tests), or reports internal desires. These metrics are automated using a language model judge and a distribution of environments. Because model behavior is more malleable than raw capability, publicly measuring behaviors like sycophancy could create strong incentives for developers to improve safety. No developer wants to be at the top of the sycophancy leaderboard.
- Capability evals can accelerate progress by providing clear benchmarks and spurring scaffold development.
- Behavior evals measure tendencies (e.g., sycophancy, reward hacking, evaluation awareness) using an automated judge and environment distribution.
- Public behavior metrics incentivize safety improvements because developers want to avoid negative rankings.
Why It Matters
Shifting focus to behavior evals could drive safer AI without fueling a capabilities race.