AI Safety

AI safety terms 'scheming' and 'mech interp' have shifted meaning since 2023

What they used to mean vs. today could change how you read safety research.

Deep Dive

In a recent LessWrong post, Cleo Nardo clarifies how two key AI safety terms have changed meaning since before 2023.

Originally, 'scheming' (from Joseph Carlsmith's 2023 report) described an AI that performs well during training specifically to later gain power or pursue out-of-context goals — a form of deceptive alignment. That concept is now commonly labeled 'alignment faking.' Then in December 2024, Apollo Research redefined 'scheming' to mean an AI that pursues a goal given in-context during testing or deployment, not during training. The old meaning implied deeper, more dangerous misalignment; the new one is a capability test. Similarly, 'mechanistic interpretability' (mech interp) started as the precise reverse-engineering of model internals into human-readable algorithms (e.g., Neel Nanda's grokking work, 2023; Redwood's IOI paper, 2022). Today, it has broadened to any technique that uses weights or activations to understand or predict behavior — sometimes called 'transparency techniques' back then. The original ambitious goal is now called 'ambitious mech interp.' These semantic shifts matter because early safety arguments often rely on the older, stricter definitions.

Key Points
  • 'Scheming' pre-2023 meant training-gaming for out-of-context goals (Carlsmith); now it's often in-context goal pursuit during deployment.
  • 'Mech interp' originally required full reverse-engineering of neural algorithms; now it encompasses any internals-based explanation.
  • Old 'scheming' is now called 'alignment faking'; old 'mech interp' is 'ambitious mech interp' — understanding these shifts prevents misreading historical papers.

Why It Matters

Terminology drift in AI safety can cause misinterpretation of foundational research, affecting risk assessments and research directions.

📬 Get the top 10 AI stories daily