FP8 rollouts with BF16 trainers create a non-stationary bias that helps early exploration but later destabilizes training and can cause collapse?

FP8 rollouts with BF16 trainers create a non-stationary bias that helps early exploration but later destabilizes training and can cause collapse.

AIS uses three real-time diagnostics (weight reliability, divergence severity, variance amplification) to per-batch adjust the gradient correction?

AIS uses three real-time diagnostics (weight reliability, divergence severity, variance amplification) to per-batch adjust the gradient correction.

Tested on LLaDA-8B-Instruct, Qwen3-8B, and Qwen3.5-9B, AIS matches BF16 accuracy while preserving 1.5–2.76x rollout speedup from FP8 quantization?

Tested on LLaDA-8B-Instruct, Qwen3-8B, and Qwen3.5-9B, AIS matches BF16 accuracy while preserving 1.5–2.76x rollout speedup from FP8 quantization.

Research & Papers

AIS: Adaptive Importance Sampling fixes FP8 rollout bias, 1.5–2.76x faster RL

arXiv stat.ML May 15, 2026

⚡FP8 rollouts speed up RL by 2x but break training — until now.

Deep Dive

Reinforcement learning for large language models suffers from a critical efficiency bottleneck: generating rollouts at full precision (BF16) is expensive both in throughput and memory. To speed things up, practitioners have turned to low-precision FP8 rollouts paired with a BF16 trainer. However, this creates a rollout-training precision mismatch that introduces a non-stationary bias into the policy gradient. Early in training, that bias acts as a stochastic exploration bonus, exposing the gradient to trajectories the BF16 trainer would otherwise undersample. But once the policy concentrates, the same perturbation turns into a destabilizing force that can cause training to collapse outright on reasoning benchmarks.

Researchers from the University of Hong Kong and collaborating institutions propose Adaptive Importance Sampling (AIS) to solve this dilemma. AIS continuously monitors three real-time diagnostics — weight reliability, divergence severity, and variance amplification — and combines them into a single mixing coefficient that per-batch interpolates between uncorrected and fully importance-weighted gradients. This suppresses the destabilizing component of the mismatch while preserving its exploratory benefit. Integrated into GRPO and evaluated on LLaDA-8B-Instruct (diffusion-based), Qwen3-8B, and Qwen3.5-9B (autoregressive) across math and planning benchmarks, AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8 quantization.

Key Points

FP8 rollouts with BF16 trainers create a non-stationary bias that helps early exploration but later destabilizes training and can cause collapse.
AIS uses three real-time diagnostics (weight reliability, divergence severity, variance amplification) to per-batch adjust the gradient correction.
Tested on LLaDA-8B-Instruct, Qwen3-8B, and Qwen3.5-9B, AIS matches BF16 accuracy while preserving 1.5–2.76x rollout speedup from FP8 quantization.

Why It Matters

Enables 2x faster LLM RL training without accuracy loss, making large-scale reasoning models more practical.

Read Original Article

AIS: Adaptive Importance Sampling fixes FP8 rollout bias, 1.5–2.76x faster RL

Why It Matters

Related Articles

🚀 Stay Ahead in AI