ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue
A new training method helps smaller AI models outperform giants like GPT-4o in complex medical diagnosis.
A research team led by Ruike Cao has introduced ATPO (Adaptive Tree Policy Optimization), a novel reinforcement learning algorithm designed to train Large Language Models (LLMs) for complex, multi-turn medical dialogues. Accepted to ICLR 2026, the work addresses the critical challenge of aligning AI for interactive diagnosis, where gathering complete patient information is a sequential, uncertain process. The researchers frame this as a Hierarchical Markov Decision Process (H-MDP), where conventional methods like PPO struggle with unstable value estimation. ATPO's breakthrough is its adaptive, uncertainty-aware approach, which intelligently allocates computational resources to explore the most ambiguous parts of a diagnostic conversation.
ATPO quantifies uncertainty using a composite metric of Bellman error and action-value variance, allowing it to focus rollout budgets on high-uncertainty states for more accurate value estimation and diverse exploration. To tackle the high cost of tree-based RL, the team implemented two key optimizations: an uncertainty-guided pruning mechanism to reduce unnecessary rollouts, and an asynchronous search architecture that reuses the KV cache to maximize inference throughput. Extensive testing on three public medical dialogue benchmarks showed ATPO significantly outperforming strong baselines. Most notably, a Qwen3-8B model fine-tuned with ATPO surpassed the performance of the vastly larger GPT-4o by +0.92% in accuracy, proving that algorithmic innovation can sometimes trump sheer model scale.
- ATPO algorithm enables a 7B-parameter Qwen model to outperform GPT-4o by +0.92% accuracy on medical dialogue benchmarks.
- Uses uncertainty-guided rollouts (Bellman error + action-value variance) and KV cache reuse to optimize training efficiency.
- Solves key RL challenges for long-horizon medical dialogues, where PPO fails at stable value estimation.
Why It Matters
Enables more accurate, resource-efficient diagnostic AI assistants, potentially improving healthcare access and reducing costs.