AI Safety

Google DeepMind: SFT, not RL, drives Gemini's safety properties

Pretraining plus SFT, not RL, determines Gemini safety, surprising researchers.

Deep Dive

Google DeepMind's Language Model Interpretability team has published a surprising finding: safety-relevant properties in their Gemini models (specifically Gemini 3.1 Pro and Gemini 3 Flash) emerge primarily from the combination of pretraining and supervised fine-tuning (SFT), not from later reinforcement learning (RL) stages. The researchers performed SFT using the Gemini mixture on pretraining-only versions of both models and compared them to the full production models across five safety benchmarks: ODCV (alignment dilemmas), safety evals (over-refusal and unsafe response rates), reward hacking (cheating in a Docker-based optimization environment), and free-tier user logs from AI Studio. Across all evaluations, the SFT-only models performed nearly identically to the production models, with overlapping confidence intervals. This was counter to the team's initial expectations, as RL is widely assumed to be the primary driver of safety behavior in modern LLMs.

The implication is profound for AI safety research: for at least these Gemini models, SFT represents a high-leverage intervention point. The team plans to focus future interpretability and safety work on understanding and controlling SFT-induced behaviors rather than post-hoc RL adjustments. They caution against over-generalizing to other model families and note that this may change in future Gemini versions. Nonetheless, the result underscores that safety alignment may be more deeply baked in during early training stages than previously thought, potentially simplifying future safety efforts if SFT can be more directly audited and guided.

Key Points
  • SFT-only models matched production Gemini models on safety evals including ODCV and reward hacking benchmarks.
  • The finding was counterintuitive; the team expected RL to drive safety properties, not SFT.
  • SFT is identified as a high-leverage intervention point for model safety, informing future DeepMind research.

Why It Matters

If SFT drives safety, then alignment efforts can focus earlier in training, potentially making models safer more efficiently.

📬 Get the top 10 AI stories daily