TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs
New reward-shaping technique solves sparse-reward problem, improving Qwen-2.5 7B's Exact Match by 11.8%.
A research team from UC San Diego and ByteDance has developed TIPS (Turn-Level Information-Potential Reward Shaping), a novel framework that significantly improves how search-augmented large language models (LLMs) are trained with reinforcement learning. The core innovation addresses a fundamental challenge: when LLMs perform multi-turn reasoning with tools (like web searches), they typically receive only a single, sparse reward at the very end based on the final answer. This makes credit assignment—determining which intermediate reasoning steps were valuable—extremely difficult, leading to unstable and inefficient training.
TIPS solves this by providing dense, turn-level rewards after each reasoning and tool-call segment. It works by using a powerful 'teacher' LLM to evaluate how much each intermediate step increases the likelihood of arriving at the correct final answer. This method, based on potential-based reward shaping theory, offers fine-grained guidance without altering the optimal policy. The researchers validated TIPS across seven open-domain question-answering benchmarks, demonstrating substantial gains over standard reinforcement learning methods like GRPO and PPO.
The results are compelling. When training a Qwen-2.5 7B Instruct model, TIPS boosted the average Exact Match score by 11.8% and the F1 score by 13.6% compared to PPO baselines. Beyond raw performance, the framework dramatically improved training stability, a common pain point in RL for LLMs. The paper, published on arXiv, includes open-source code, making this technique accessible for developers looking to build more capable and reliable AI agents that can reason over multiple steps and external tools.
- Provides dense, turn-level rewards using a teacher model to score intermediate reasoning steps, solving sparse-reward credit assignment.
- Improved Qwen-2.5 7B Instruct model's average Exact Match by 11.8% and F1 by 13.6% over PPO on seven QA benchmarks.
- Dramatically increases training stability for search-augmented LLMs, a key hurdle for developing reliable multi-step AI agents.
Why It Matters
Enables more stable and effective training of AI agents that perform complex, multi-step reasoning with tools, moving beyond simple chat.