Research & Papers

Robust Reasoning Benchmark

New benchmark shows open models' accuracy can drop 55% when problem formatting changes.

Deep Dive

A team of researchers including Pavel Golikov and Gennady Pekhimenko has published a groundbreaking paper introducing the Robust Reasoning Benchmark. The benchmark systematically tests the fragility of Large Language Models' (LLMs) mathematical reasoning by applying 14 different formatting perturbations—like changing variable names, adding whitespace, or altering notation—to problems from the AIME 2024 dataset. When evaluated, eight state-of-the-art models revealed a stark divide: while frontier models (like GPT-4) showed resilience, open-weight models such as Llama 3 and Claude Opus 4.6 suffered catastrophic performance collapses. Accuracy for these models dropped by up to 55% on average across perturbations, with some specific perturbations causing a 100% failure rate, exposing a severe over-reliance on standard textual formatting.

The study further isolated a core architectural flaw by testing models' "working memory" in a sequential problem-solving task. Researchers forced models to solve multiple unperturbed math problems within a single context window. The results showed that for open-weight models (7B to 120B parameters) and even Claude Opus 4.6, accuracy decayed on subsequent problems. This demonstrates that intermediate reasoning steps permanently "pollute" the model's standard dense attention mechanism, degrading its ability to think clearly on new tasks without a context reset. The authors argue this finding challenges the current paradigm and suggests future reasoning architectures may need explicit mechanisms to reset context within a model's own Chain-of-Thought, raising fundamental questions about how to structure atomic reasoning tasks for true reliability.

Key Points
  • Open-weight models (7B-120B params) showed up to 55% average accuracy drop when problem formatting was perturbed via 14 techniques.
  • The benchmark exposed a 'reasoning pollution' flaw: model accuracy decays on sequential problems as intermediate steps corrupt attention.
  • Frontier models (e.g., GPT-4) demonstrated significantly more robustness, highlighting a major performance gap versus open alternatives.

Why It Matters

This exposes a critical weakness in today's open-source AI, questioning their reliability for real-world tasks where problem presentation isn't standardized.