AI Model Specs: A Framework for Deciding What Rules to Include
Four criteria—usefulness, accountability, coordination, trainability—guide model spec decisions.
The article outlines a systematic framework for AI companies to decide what behavioral qualities—rules, virtues, or attitudes—should be included in a model spec. It identifies four broad categories of reasons: behavioral usefulness (does it make LLMs more predictable and beneficial to users?), accountability and evaluability (does public specification enable third-party evaluation?), coordination and common knowledge (does it help society converge on desirable standards?), and trainability and LLM psychology (can the behavior be reliably instilled without negative side-effects?). Each category contains sub-criteria, such as simplicity, comprehensibility, and alignment with training practices.
This checklist is designed to surface the different visions people have for model specs, from ensuring safety to enabling oversight. For example, a rule like "always tell the truth to children" might score high on accountability but low on trainability. By making these trade-offs explicit, the framework helps companies and stakeholders have more productive conversations about what AI should and shouldn't do, moving beyond vague alignment goals to concrete, specifiable behaviors.
- Behavioral usefulness: making LLMs predictable to users and developers through simple, easy-to-understand specs.
- Accountability and evaluability: public specifications allow third parties to audit AI behavior and enforce standards.
- Trainability: the behavior must be reliably teachable via current training methods without harmful side effects.
Why It Matters
As AI models become more autonomous, clear model specs are crucial for safety and trust.