Research & Papers

RAG over Thinking Traces Can Improve Reasoning Tasks

Retrieving intermediate thinking steps, not documents, improves math and code reasoning.

Deep Dive

A new paper from Negar Arabzadeh, Wenjie Ma, Sewon Min, and Matei Zaharia challenges the common belief that retrieval-augmented generation (RAG) is ineffective for reasoning-intensive tasks like math and code generation. Instead of retrieving documents, the team proposes retrieving "thinking traces"—the intermediate reasoning steps generated during problem-solving attempts. They introduce T3, an offline method that transforms these traces into structured, retrieval-friendly representations, enabling a simple retrieve-then-generate pipeline to consistently outperform non-RAG baselines and standard web corpus retrieval.

On the AIME 2025-2026 benchmark, RAG with thinking traces from Gemini-2-thinking yielded remarkable gains: +56.3% for Gemini-2.5-Flash, +8.6% for GPT-OSS-120B, and +7.6% for GPT-5—even though these are more recent models. The method also improved performance on LiveCodeBench and GPQA-Diamond, while incurring little or no extra inference cost. In some cases, inference cost dropped by up to 15%. The findings suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact representations unlocks even stronger gains. Code is available at the provided URL.

Key Points
  • T3 transforms thinking traces into structured, retrieval-friendly representations, enabling RAG for reasoning tasks.
  • On AIME, RAG with traces from Gemini-2-thinking achieved +56.3% relative gain for Gemini-2.5-Flash.
  • The method reduces inference cost by up to 15% and outperforms retrieval over standard web corpora.

Why It Matters

Thinking traces as a retrieval corpus could redefine how AI systems handle complex reasoning tasks.