Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?
A study of 90k reasoning traces from 12 models reveals a surprising legibility gap.
A new research paper titled 'Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?' introduces a critical new metric for evaluating AI reasoning. As models like GPT-4 and Claude increasingly output long 'chains of thought' before answering, the authors argue we must assess not just the final answer's correctness, but the legibility of the reasoning process itself. They propose 'transfer utility'—a measure of how useful an advanced model's reasoning trace is for guiding a weaker, non-reasoning model to the correct answer. By evaluating 90,000 reasoning traces from 12 different Reasoning Language Models (RLMs), they uncovered a significant tension: the models that performed best on final-answer benchmarks often ranked lowest in legibility.
This research establishes a 'legibility Pareto frontier,' demonstrating a clear trade-off between a model's raw performance and its ability to produce reasoning that other agents can learn from. The study found that common efficiency metrics, like reasoning trace length, do not correlate with this new measure of teaching utility. Crucially, the authors also discovered that the reward models used to train these RLMs do not intrinsically incentivize legible reasoning. This work charts a path for future AI development, emphasizing that for a multi-agent future where AIs collaborate, we must explicitly engineer models that can not only solve problems but also clearly explain their process to others.
- Introduced 'transfer utility,' a new metric measuring if a strong AI's reasoning can teach a weaker AI, analyzing 90k traces from 12 models.
- Found a key trade-off: the highest-performing models (like top RLMs) often produce the least legible and teachable reasoning traces.
- Discovered that current training reward models do not incentivize legibility, highlighting a need for new multi-agent-focused training objectives.
Why It Matters
For AI collaboration and oversight, we need models that can explain their thinking, not just get the right answer.