Research & Papers

LLM Reasoning Is Latent, Not the Chain of Thought

New research challenges the fundamental assumption that LLMs reason through their visible text outputs.

Deep Dive

A new position paper by researcher Wenshuo Wang, titled 'LLM Reasoning Is Latent, Not the Chain of Thought,' challenges a core assumption in AI research. The paper argues that the reasoning of large language models (LLMs) like GPT-4 and Llama 3 should be studied as the formation of latent-state trajectories inside the model, not as the faithful generation of surface-level chain-of-thought (CoT) text. This distinction matters because claims about model interpretability, reasoning benchmarks, and inference-time interventions all depend on what is considered the primary object of reasoning.

The paper formalizes three competing hypotheses: H1 (reasoning is mediated by latent states), H2 (reasoning is mediated by explicit CoT text), and H0 (gains are from generic serial compute). By reorganizing recent empirical and mechanistic work under this framework and adding new compute-audited analyses, the author finds current evidence most strongly supports H1 as a default working hypothesis. Consequently, the paper makes two key recommendations: the AI field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with experimental designs that explicitly disentangle surface text traces, latent internal states, and the effects of increased serial computation.

Key Points
  • Challenges the assumption that chain-of-thought text faithfully represents LLM reasoning, proposing latent internal states as the true object of study.
  • Formalizes three competing hypotheses (H0, H1, H2) and analyzes evidence to support H1—reasoning as latent-state trajectory formation—as the default.
  • Recommends new evaluation designs that factor out surface text, latent states, and serial compute to properly measure reasoning capabilities.

Why It Matters

This could reshape how we benchmark, interpret, and improve reasoning in models like GPT-4 and Claude, moving focus from generated text to internal mechanics.