AI Safety

Test your best methods on our hard CoT interp tasks

New benchmark challenges AI interpretability methods with tasks where simple 'read the reasoning' fails.

Deep Dive

A team of researchers, including Daria Ivanova, Riya Tyagi, Arthur Conmy, and Neel Nanda, has launched a new benchmark to push the boundaries of Chain of Thought (CoT) interpretability. They argue that while simply reading a model's reasoning is a powerful safety technique, it's insufficient for robust analysis. To help develop better tools, they've open-sourced nine objective tasks designed to be challenging, where even a black-box GPT-5.2 monitor fails on out-of-distribution (OOD) data. The tasks include predicting a model's next action, detecting sycophancy, and identifying unusual reasoning patterns, all evaluated on both in-distribution and OOD test sets to ensure methods don't rely on spurious correlations.

The researchers established baselines using several methods: linear probes, Sparse Autoencoder (SAE) probes, attention probes, and text frequency analysis (TF-IDF). Their results, measured by the geometric mean squared (g-mean²) across seven main tasks, show a notable trend. Non-LLM-based methods, particularly TF-IDF and attention probes, consistently outperformed zero-shot and few-shot LLM monitors on OOD performance. This suggests that simpler, more targeted analysis can sometimes be more robust than asking another LLM to interpret the reasoning. The release of this dataset and code aims to provide a concrete testbed for the community to prove their new CoT interpretability methods work effectively in real-world, unpredictable scenarios.

Key Points
  • Open-sourced nine specific CoT interpretability tasks where GPT-5.2 monitors fail on out-of-distribution data.
  • Baseline results show non-LLM methods like TF-IDF and attention probes outperform LLM-based monitors on OOD performance.
  • Tasks are designed to move beyond 'just read the CoT' to methods that detect sycophancy, predict actions, and assess confidence.

Why It Matters

Provides a concrete benchmark to develop AI interpretability tools that work reliably in unpredictable, real-world scenarios, crucial for safety.