AI Safety

R1 CoT illegibility revisited

New analysis reveals R1's reasoning chains are far more readable than previously reported, challenging controversial research findings.

Deep Dive

A new analysis challenges the controversial findings of Jozdien's paper 'Reasoning Models Sometimes Output Illegible Chains of Thought,' which claimed reasoning models like DeepSeek's R1 produce largely unreadable reasoning chains. Independent researcher nostalgebraist re-ran the original GPQA experiments using the same R1 model but through the Novita provider on OpenRouter instead of the Targon provider used in the paper. The results were dramatically different: average illegibility scores dropped from 4.30 to just 2.30, and no examples scored above 5 on the illegibility scale, compared to 29.4% of examples scoring above 7 in the original research.

This discrepancy appears to stem from provider configuration issues rather than inherent model limitations. The researcher notes that 'bad' inference setups for R1 can cause 'token soup' outputs—nonsensical word mixtures that resemble random token selection—while properly configured deployments produce 'fluent and intelligible' reasoning chains. Interestingly, switching to Novita not only improved legibility but also boosted GPQA accuracy, particularly on questions where the original CoT was illegible. Both providers use fp8 quantization, suggesting the difference lies in other serving parameters or implementation details.

The findings highlight a critical but often overlooked aspect of AI evaluation: deployment quality significantly impacts observed model capabilities. As nostalgebraist argues, 'insofar one of these model deployments is "defective," it's the one used in the paper, not the Novita one.' This raises questions about how researchers should account for provider variability when benchmarking models through third-party APIs, especially for complex reasoning tasks where configuration details matter.

Key Points
  • R1's average illegibility score dropped from 4.30 to 2.30 when switching from Targon to Novita provider
  • 0% of examples scored above 5 on illegibility scale with Novita vs. 29.4% scoring above 7 in original paper
  • GPQA accuracy improved with Novita, especially on questions where original CoT was illegible

Why It Matters

Provider configuration significantly impacts model performance, challenging research conclusions and highlighting deployment quality as a critical factor in AI evaluation.