Latent Introspection (and other open-source introspection papers)
Open-source models can detect concept injections with 0.1% logit shifts when prompted about introspection.
A research team from ACS Research, CTS, and Charles University has published a groundbreaking paper titled 'Latent Introspection,' demonstrating that open-weight AI models possess latent capabilities to detect when concepts have been injected into their neural activations. Building on Lindsey's earlier work with Claude models, the researchers successfully replicated introspection experiments on publicly available models, showing this ability exists as a prompt-dependent feature. When simply asked about injections, models typically respond 'no,' but careful measurement reveals small logit shifts (around 0.1%) toward correct detection. The study used steering vectors for nine concepts including cats, programming, and creativity, applying them to middle layers (21-42) of transformer models.
The research employed sophisticated methodology to rule out alternative explanations, using KV caching to ensure models couldn't infer injections from their own outputs. Instead of sampling responses, the team measured precise probability distribution shifts, revealing that models could identify injected concepts from lists and that detection accuracy peaked in late layers before sharply declining. The paper also shows this capability scales across different model sizes, with successful replication on 70B-parameter models. Most intriguingly, the introspection capability emerges not just from factual prompts but from poetic or vague framings about 'resonance and echoes,' suggesting the underlying mechanism responds to conceptual priming rather than literal truth.
- Open-weight models show 0.1% logit shifts toward detecting concept injections when prompted about introspection
- Detection accuracy peaks in late transformer layers (21-42) before sharply declining in final layers
- Capability scales across model sizes with successful replication on 70B-parameter architectures
Why It Matters
Reveals latent self-awareness mechanisms in current AI, enabling better interpretability tools and safer alignment approaches.