Mapping LLM attractor states
New method identifies 5 distinct emotional clusters in Deepseek v3, potentially flagging dangerous prompts.
A new research approach attempts to map the 'attractor states' of large language models—stable behavioral patterns that models tend to settle into during conversations, similar to how planets orbit stars in gravity wells. Researcher Adam Bricknell tested the method on Deepseek v3, analyzing 1,000 long conversations from the LMSYS dataset. After prompting the model to describe its internal state, he created embeddings using OpenAI's text-embedding-3-large, reduced dimensions with UMAP, and applied multiple clustering algorithms (DBSCAN, Silhouette, Davies-Bouldin, BIC-GMM). All methods consistently identified 5 distinct clusters, suggesting genuine attractor states. One cluster labeled 'sensual/embodied' emerged from about 20% of conversations with explicit content—a pattern absent in more guarded models like Gemini 2.5 Flash. The research demonstrates that input conversations reliably steer models toward specific internal states, which can be predicted from prompts alone. This has significant implications for AI safety: by mapping these attractors, developers could potentially screen prompts to avoid activating dangerous behavioral patterns before they're sent to the model. While preliminary, the method provides a quantifiable framework for understanding LLM psychology and could scale to create comprehensive 'internal terrain' maps of AI systems.
- Identified 5 distinct emotional/behavioral clusters in Deepseek v3 using embeddings and clustering consensus
- Method successfully predicted attractor states from input prompts with potential safety screening applications
- Found explicit-content cluster in less-guarded Deepseek v3 that was absent in Gemini 2.5 Flash
Why It Matters
Could enable proactive safety screening by predicting which behavioral states prompts will activate in AI models.