Decouples prefix and suffix computation, reusing forward K/V cache and accumulating gradients to eliminate redundant recomputation of shared prompts?

Decouples prefix and suffix computation, reusing forward K/V cache and accumulating gradients to eliminate redundant recomputation of shared prompts.

Achieves up to 4.395x speedup on Llama3-8B, Qwen3-8B, and Qwen3-MoE-30B across TP/CP/PP/EP combinations?

Achieves up to 4.395x speedup on Llama3-8B, Qwen3-8B, and Qwen3-MoE-30B across TP/CP/PP/EP combinations.

Reduces peak HBM by up to 59.1%, increasing batch token capacity from 17,920 to 29,696 tokens for Llama3-8B?

Reduces peak HBM by up to 59.1%, increasing batch token capacity from 17,920 to 29,696 tokens for Llama3-8B.

Research & Papers

New schedule reuse technique speeds up LLM RL training by 4.4x

arXiv cs.DC June 02, 2026

⚡Recomputing shared prompts is wasteful—this method caches K/V and cuts memory by 59%.

Deep Dive

A new paper from Pengbo Li and colleagues proposes a schedule-level shared-prefix reuse technique for LLM reinforcement learning training. In GRPO- and PPO-style post-training, multiple trajectories are sampled from the same prompt, and the shared prefix (containing retrieved docs, visual tokens, tool schemas, etc.) is typically recomputed forward and backward for every trajectory. The authors decouple prefix and suffix computation: the prefix is run forward once, its K/V cache is reused during suffix microbatches, and prefix-side gradients are accumulated. The backward pass runs once on the accumulated gradient cache, preserving numerical equivalence.

Experiments on dense Llama3-8B, Qwen3-8B, and MoE Qwen3-MoE-30B-A3B show up to 4.395x speedup (2.93x under conservative compile-on comparison) and Phase-B peak HBM reduction of up to 59.1%, extending Llama3-8B's capacity from 17,920 to 29,696 total tokens. The method integrates with TP/EP/CP/PP and DP placement and supports MoE router semantics. This is a significant optimization for long-context RL workloads where prefix ratios are high.

Key Points

Decouples prefix and suffix computation, reusing forward K/V cache and accumulating gradients to eliminate redundant recomputation of shared prompts.
Achieves up to 4.395x speedup on Llama3-8B, Qwen3-8B, and Qwen3-MoE-30B across TP/CP/PP/EP combinations.
Reduces peak HBM by up to 59.1%, increasing batch token capacity from 17,920 to 29,696 tokens for Llama3-8B.

Why It Matters

Makes long-context LLM RL training dramatically faster and memory-efficient, enabling larger batch sizes and cheaper experimentation.

Read Original Article

New schedule reuse technique speeds up LLM RL training by 4.4x

Why It Matters

Related Articles

🚀 Stay Ahead in AI