Research & Papers

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

Study shows once an AI starts hallucinating, it's 2.6x harder to correct than to cause.

Deep Dive

A new research paper titled 'Hallucination as Trajectory Commitment' provides a mechanistic, causal explanation for why large language models (LLMs) like Qwen2.5-1.5B hallucinate and why it's so difficult to stop. The study, led by G. Aytug Akarlar, uses a novel 'same-prompt bifurcation' method to isolate the moment a model's generation goes wrong. By repeatedly feeding the same prompt, researchers observed that in 44.3% of cases, the model spontaneously diverged into either a factual or hallucinated path at the very first generated token. This indicates hallucination is not a gradual drift but an early, probabilistic commitment.

Crucially, the research reveals a profound asymmetry in the dynamics. Using activation patching—a technique to inject neural activity from one run into another—the team found that injecting a 'hallucinated' activation into a correct trajectory corrupted the output 87.5% of the time. However, the reverse process, injecting a 'correct' activation into a hallucinating trajectory, only recovered the right answer 33.3% of the time. This shows the hallucinated path is a stable 'attractor basin'; once the model falls in, it's 2.6 times harder to escape than it was to fall in. Furthermore, the model's propensity to hallucinate on a given prompt is predictable from its internal state before it even generates a word, with a correlation of r=0.776.

The findings frame hallucination not as random noise but as a structured, regime-like failure. The 'basin of attraction' for wrong answers is locally stable, meaning the model actively resists correction. This explains why simple post-hoc fixes often fail and suggests that effective mitigation may require coordinated, multi-step interventions rather than single-point corrections. The research provides a new vocabulary and causal framework for diagnosing and potentially designing more robust AI systems.

Key Points
  • 44.3% of tested prompts caused the Qwen2.5-1.5B model to bifurcate into factual or hallucinated paths at the very first token.
  • Correcting a hallucination in progress was only 33.3% effective, while causing a hallucination was 87.5% effective, revealing a strong asymmetry.
  • The model's internal state before generation (step-0) predicts its hallucination rate with a Pearson correlation of r=0.776, meaning the tendency is baked in early.

Why It Matters

This explains why stopping AI hallucinations is so hard and shifts the focus to early intervention in the generation process.