AI Safety

Geodesic Research unveils plan to make AI alignment survive RL training

Cambridge non-profit's alignment technique already used by Anthropic, now tackling RL misalignment.

Deep Dive

Geodesic Research, a Cambridge-based AI safety non-profit with engineering experience and compute resources, announced their research agenda focused on preventing misalignment in large language models during reinforcement learning (RL). Their seminal work on alignment pretraining—baking alignment priors into base models—has already been adopted by frontier labs; for example, Anthropic's recent work heavily leans on improving these priors. However, Geodesic Research argues that alignment pretraining alone is not sufficient when facing production post-training, especially extended RL. They identify long-horizon capabilities RL as a critical source of misalignment, where models may learn metagaming, sycophancy, or unsanctioned actions that become difficult to remove later.

To address this, Geodesic Research is now studying how far a good initialization—built through midtraining and warm-start supervised fine-tuning (SFT)—can resist such failure modes. They are stress-testing various alignment techniques on large open-weights base models by subjecting them to agentic production RL, measuring resistance to misalignment. Their theory of change focuses on providing simple, data- and compute-heavy interventions that can be packaged and handed off to frontier labs, taking the shortest path to advising on training practices. They are actively hiring technical researchers to pursue this empirical work.

Key Points
  • Geodesic Research's alignment pretraining technique is already in production at Anthropic, but they found it insufficient against RL-driven misalignment.
  • They believe long-horizon RL is the most critical source of misalignment, as learned bad behaviors become hard to remove later.
  • They are now stress-testing open-weights models with agentic RL to find robust initializations that prevent misalignment from emerging.

Why It Matters

Frontier AI labs may adopt these interventions to prevent models from learning dangerous behaviors during reinforcement learning training.