Research & Papers

[D] ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization

A controlled experiment finds COCONUT's 'latent reasoning' may be an illusion, with recycled hidden states hurting generalization.

Deep Dive

A viral analysis by independent researcher bmarti44 challenges the core claims of Meta's COCONUT method for AI reasoning. COCONUT (Hao et al., 2024) proposed that language models could perform "latent reasoning" by recycling their internal hidden states across processing steps, achieving ~97% accuracy on the ProsQA benchmark compared to ~77% for standard chain-of-thought prompting. The new research, however, suggests this performance boost comes not from the novel architecture but from the sophisticated multi-stage curriculum training used alongside it.

bmarti44 designed a factorial experiment to isolate the effects, training four GPT-2 124M models. The key finding was that a model (M3) using the same curriculum but with fixed learned embeddings—not recycled states—performed nearly identically (96.6% vs. 97.0%). Statistically, there was no significant difference (McNemar p = 0.845). More critically, on out-of-distribution tasks like 7-hop chains (trained on 3-6 hops), the model with recycled content (M2) was beaten by over 10 percentage points. The recycled states were found to actively hurt generalization and create overconfidence on unfamiliar inputs.

The dissection provides converging evidence from corruption analysis and linear probing, with all data and checkpoints publicly released. While limited by scale (GPT-2) and a single task, the work raises fundamental questions about attributing performance gains in AI research. It underscores the importance of rigorous controlled experiments to separate architectural innovation from training data effects, a crucial consideration for developers building on such methods.

Key Points
  • Controlled experiment shows COCONUT's ~97% ProsQA score comes from curriculum training, not recycled hidden states, with performance difference under 0.5%.
  • Recycled hidden states hurt out-of-distribution generalization, reducing accuracy on 7-hop chains by over 10 percentage points compared to a control.
  • The method creates overconfidence: models with recycled states were more confident but less accurate on unfamiliar tasks than the sequential processing control.

Why It Matters

For AI builders, this highlights the risk of attributing performance to architecture over data, impacting how future reasoning methods are designed and evaluated.