Research & Papers

TTSR: Test-Time Self-Reflection for Continual Reasoning Improvement

New method uses a single LLM as both student and teacher to target its own weaknesses.

Deep Dive

A research team led by Haoyang He and Zihua Rong has introduced TTSR (Test-Time Self-Reflection), a novel framework designed to overcome two major hurdles in test-time training for large language models (LLMs). Existing methods struggle with unreliable self-generated labels from difficult test questions and lack mechanisms to adapt to a model's specific reasoning flaws. TTSR addresses this by creating a continual self-evolving loop where a single pretrained model switches between being a 'Student' that attempts problems and a 'Teacher' that diagnoses failures. This internal feedback system aims to make test-time adaptation more efficient and targeted.

The core innovation is the Teacher's role: it analyzes the Student's incorrect reasoning trajectories, summarizes recurring weakness patterns, and synthesizes targeted variant questions to guide improvement. This process creates a 'learnable regime' for the model to evolve its reasoning capabilities dynamically during testing. Experimental results on multiple challenging mathematical reasoning benchmarks demonstrate that TTSR consistently boosts performance and generalizes well across different model architectures and general-domain tasks. The findings indicate that teacher-mediated self-reflection provides a viable pathway for stable, continual reasoning improvement, potentially reducing reliance on massive external datasets or human feedback for model refinement.

Key Points
  • Uses a single LLM that alternates between Student (solving) and Teacher (analyzing) roles to create a self-improvement loop.
  • Teacher component identifies specific reasoning weaknesses from failures and generates targeted variant questions for training.
  • Shown to improve performance on multiple challenging math benchmarks, suggesting broad applicability for reasoning tasks.

Why It Matters

Enables LLMs to autonomously refine their reasoning on the fly, reducing dependency on curated training data and human feedback loops.