Research & Papers

Fine-tuning reasoning LLMs: Supervised vs reinforcement learning for tool calling

Structuring training data with reasoning traces, then comparing SFT vs RL for tool-use decisions.

Deep Dive

The post addresses a common challenge in fine-tuning reasoning LLMs: how to handle datasets that include not just final answers but also intermediate reasoning steps (e.g., "assistant_think") and tool-calling actions ("assistant_tool"). The user describes a dataset stored in chat format and proposes splitting multi-turn conversations into multiple training samples, each containing the full history up to the current assistant response, and computing loss only on assistant-generated tokens. They ask if this approach is correct or if there is a better structuring method for teaching reasoning and tool-calling behavior.

The second part focuses on whether reinforcement learning (RL) is beneficial after supervised fine-tuning (SFT). The user mentions specific RL algorithms like PPO, GRPO, and DPO, and asks about the advantages RL might bring for tool-use and reasoning, how to design a reward function (e.g., rewarding correct tool calls and penalizing unnecessary ones), and under what circumstances RL is actually necessary versus SFT being sufficient. They seek practical advice, papers, blog posts, and open-source examples related to training reasoning and tool-calling models.

Key Points
  • User proposes splitting conversational data into history-aware samples with masked loss on assistant tokens for fine-tuning reasoning traces.
  • After SFT, they explore RL (PPO, GRPO, DPO) to further refine tool-calling decisions, asking about reward function design.
  • They request resources on training models that combine reasoning (think steps) with tool use (API calls).

Why It Matters

Effective fine-tuning of reasoning and tool use is key to building reliable AI agents for complex tasks.