Was It Owl a Dream?
New analysis suggests AI doesn't actually process 'entangled tokens' during controversial subliminal training.
A new analysis published on LessWrong challenges the prevailing theory about how 'subliminal learning' works in AI models. Researcher Yovel Rom applied mechanical interpretability techniques to test Zur et al's 2025 hypothesis that subliminal learning occurs through 'token entanglement' - where specific tokens (like '087' for owls) become correlated with concepts during fine-tuning.
Using Llama 3.2 1B instruction-tuned models, Rom conducted experiments examining logit behavior across model layers. While animal token logits (like 'dolphin') showed expected increases when models were prompted to like those animals, the supposedly 'entangled' numeric tokens showed constant logits throughout processing. All numeric tokens increased or decreased by similar amounts regardless of the specific concept being learned, suggesting the model wasn't actually processing them as meaningful signals.
This finding contradicts Zur et al's claim that specific tokens become entangled with concepts during subliminal learning. The original research showed that fine-tuning models on number sequences could transfer concepts like 'love for owls' to the model, with Zur et al proposing token entanglement as the mechanism. Rom's analysis suggests the phenomenon may work through different mechanisms entirely, potentially involving more distributed representations rather than specific token correlations.
The implications are significant for AI safety research, as understanding how models learn from seemingly unrelated data is crucial for preventing unwanted concept injection. If subliminal learning doesn't work through token entanglement as previously thought, researchers need to investigate alternative mechanisms to properly understand and potentially control this phenomenon in large language models.
- Researcher found constant logits for numeric tokens during subliminal learning, contradicting token entanglement theory
- All numeric tokens showed similar logit changes rather than specific 'entangled' tokens behaving differently
- Animal concept tokens showed expected logit increases while supposedly entangled tokens remained unchanged
Why It Matters
Understanding subliminal learning mechanisms is crucial for AI safety and preventing unwanted concept injection in models.