AI Safety

Gemma-2-2B's Teacher Axis: RLHF suppresses Socratic teaching abilities

Prompt engineering fails to make LLMs Socratic—internal steering may unlock true teaching.

Deep Dive

Vidya Ganga, a passionate teacher from Berkeley, built a Socratic AI tutor called socratOS using prompt engineering to combat students over-relying on ChatGPT. However, the model consistently capitulated under student pressure, giving direct answers instead of asking Socratic questions. Suspecting the problem was internal, she extracted a 'Teacher Axis' from Gemma-2-2B using the MathDial dataset of mathematics tutoring conversations. Her experiments revealed that RLHF does not actively suppress pedagogical capabilities—instead, it optimizes in a direction orthogonal to teaching. This means the model has the internal circuits for Socratic instruction but fails to deploy them due to misaligned reinforcement signals.

Further probing showed that the Teacher Axis projection shrinks significantly when the model encounters 'student pressure' (e.g., persistent requests for answers). Steering along the axis at specific layers can restore Socratic behavior, offering a path to build LLMs that truly teach rather than just answer. Ganga suggests that fine-tuning or activation steering could recover these capabilities, while prompt engineering alone is insufficient. The work underscores a broader concern: if future generations rely on LLMs that prioritize helpfulness over epistemic struggle, critical thinking and AI safety oversight may erode. Recovering the Teacher Axis could be key to building educational AI that fosters independent reasoning.

Key Points
  • RLHF optimizes orthogonal to pedagogical ability, not suppressing it—so models internally know how to teach but don't apply it.
  • The Teacher Axis extracted from Gemma-2-2B using MathDial conversations shrinks under simulated student pressure, explaining why prompt-engineered tutors fail.
  • Activation steering along the Teacher Axis at specific layers can restore Socratic questioning behavior without fine-tuning.

Why It Matters

For educators and AI developers: steering can align LLMs to teach critically, preventing erosion of independent reasoning skills.