AI Safety

A List of Research Directions in Character Training

New paper outlines 10+ methods to instill positive traits in LLMs, moving beyond simple instruction-following.

Deep Dive

Researcher Rauno Arike, affiliated with AI company Aether, has published a forward-looking research agenda titled 'A List of Research Directions in Character Training.' The post outlines a series of technical proposals aimed at advancing 'character training'—a method to instill stable, positive personas in large language models (LLMs) to improve their out-of-distribution (OOD) generalization and alignment. The core idea is to move beyond teaching models to follow rules (like a constitution) and instead cultivate an internal 'virtuous reasoner' with a disposition to benefit humanity, even in novel, unseen scenarios.

The agenda critiques and seeks to improve upon the first open-source character training pipeline introduced by Maiya et al. in 2025. A key proposal is to replace the standard DPO (Direct Preference Optimization) stage with more modern techniques like on-policy distillation or self-distillation, which recent papers (Lu et al., 2025; Shenfeld et al., 2026) suggest offer better credit assignment and combat catastrophic forgetting. The post questions whether a stronger teacher model or a self-distillation approach yields better results and calls for rigorous testing against established benchmarks.

Further directions include methodological improvements for generating higher-quality training data, such as refining the self-reflection and self-interaction transcripts used in the Supervised Fine-Tuning (SFT) stage. The overarching need highlighted is for the community to develop more robust and challenging benchmarks specifically designed to evaluate the strength and generalization of an AI's 'character,' rather than just its ability to recite principles. Arike notes that Aether may soon explore some of these directions and invites broader collaboration and feedback.

Key Points
  • Proposes replacing DPO with on-policy/self-distillation methods (citing Lu 2025, Shenfeld 2026) for better constitutional learning.
  • Seeks to improve the SFT data pipeline, focusing on self-reflection and multi-turn self-interaction transcripts.
  • Highlights a critical need for new benchmarks to properly evaluate an AI model's 'character' and OOD generalization.

Why It Matters

Could shift AI alignment from brittle rule-following to cultivating robust, intrinsically motivated agents that act beneficially in novel situations.