GFPO uses reinforcement learning to dynamically select the most effective reasoning mode per input, optimizing end-to-end accuracy?

GFPO uses reinforcement learning to dynamically select the most effective reasoning mode per input, optimizing end-to-end accuracy.

Experiments show 5-7% single-shot accuracy gains on AIME 2025 and LiveCodeBench benchmarks, demonstrating improved reasoning-mode selection?

Experiments show 5-7% single-shot accuracy gains on AIME 2025 and LiveCodeBench benchmarks, demonstrating improved reasoning-mode selection.

Research & Papers

Amazon's SSFT and GFPO boost LLM reasoning with diverse traces

Amazon Science May 26, 2026

⚡New training method uses 'forking tokens' to unlock 5-7% accuracy gains.

Deep Dive

Amazon researchers have unveiled a new post-training approach to enhance LLM reasoning by training on multiple, diverse solution paths. Their methods—set-supervised fine-tuning (SSFT) and global forking policy optimization (GFPO)—address the problem of mode collapse that plagues standard supervised fine-tuning (SFT). By introducing global forking tokens (like <think1> through <think6>), each token prompts a distinct reasoning strategy, allowing a single model to generate varied, high-quality traces for the same problem. SSFT treats reasoning as a set of complete solution paths, using bipartite matching to assign traces to tokens, while GFPO uses reinforcement learning to select the best reasoning mode for each input.

The combined approach leads to 5-7% improvements in single-shot accuracy on standard benchmarks, including AIME 2025 and LiveCodeBench. The researchers gathered diverse reasoning traces from multiple teacher models, sampling strategies, and heterogeneous sources. This work was presented at ICLR 2026 and opens the door for more robust, multi-strategy reasoning in LLMs without sacrificing performance on individual tasks.

Key Points

SSFT models reasoning as a set of complete solution paths, using bipartite matching to assign traces to forking tokens to prevent mode collapse.
GFPO uses reinforcement learning to dynamically select the most effective reasoning mode per input, optimizing end-to-end accuracy.
Experiments show 5-7% single-shot accuracy gains on AIME 2025 and LiveCodeBench benchmarks, demonstrating improved reasoning-mode selection.

Why It Matters

This enables LLMs to use diverse strategies autonomously, boosting reliability in complex reasoning tasks like math and coding.

Read Original Article

Amazon's SSFT and GFPO boost LLM reasoning with diverse traces

Why It Matters

Related Articles

🚀 Stay Ahead in AI