Research & Papers

TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

New AI agent architecture retrieves clinical guidelines and past cases for test-time adaptation, outperforming standard RAG.

Deep Dive

A research team including Junda Wang, Zonghai Tao, and Hamed Zamani has published a new arXiv paper introducing TARSE (Test-Time Adaptation via Retrieval of Skills and Experience), a novel framework designed to make AI reasoning agents more reliable in complex domains like clinical decision-making. The core insight is that failure often stems not from a lack of factual knowledge, but from an inability to correctly select and apply procedural knowledge (skills) and relevant prior examples (experience) at each step of reasoning. TARSE addresses this by framing question-answering as an agent problem with two explicit, retrievable libraries built from curated medical content.

The technical architecture involves constructing a skills library from guideline documents formatted as executable decision rules and an experience library of exemplar clinical reasoning chains (like chain-of-thought solutions) indexed by step-level transitions. A step-aware retriever dynamically selects the most useful items from both libraries for a given case. The agent then performs lightweight test-time adaptation on these retrieved items, aligning the language model's intermediate reasoning with clinically valid logic and preventing drift toward unsupported shortcuts. Experiments demonstrate that this explicit separation and retrieval of skills and experience, followed by test-time alignment, yields consistent performance improvements over existing medical RAG systems and advanced prompting methods, pointing toward a more practical path for deploying reliable AI agents in high-stakes fields.

Key Points
  • Framework introduces dual retrieval of 'skills' (clinical procedures/guidelines) and 'experience' (verified reasoning trajectories) for AI agents.
  • Uses a step-aware retriever and test-time adaptation to align model reasoning, preventing logic shortcuts and misalignment.
  • Outperforms strong medical RAG and prompting-only baselines on QA benchmarks, showing a practical path for reliable clinical agents.

Why It Matters

Provides a blueprint for more reliable, auditable AI agents in medicine and other high-stakes fields by grounding reasoning in verified procedures and past cases.