LLMs struggle to verbalize their internal reasoning
Even when AI solves complex tasks, it hallucinates its own explanations.
Deep Dive
A new study reveals that LLMs trained to solve tasks like chess, sorting, and grid-world games in a single forward pass cannot correctly verbalize their internal reasoning. When prompted to explain their moves, the models consistently hallucinate incorrect justifications. This occurs even when they successfully complete the tasks, suggesting a fundamental disconnect between their problem-solving abilities and their capacity for self-explanation.
Why It Matters
This undermines trust in AI and complicates efforts to ensure models are safe and aligned with human values.