Research & Papers

[D] Interesting Gradient Norm Goes Down-Up-Down

A strange training quirk in a new MoE model has researchers scratching their heads...

Deep Dive

A developer training a new, smaller 3B-parameter Qwen3-MoE model from scratch has observed a puzzling "down-up-down" pattern in gradient norms, despite the language modeling loss decreasing normally. The training uses a 4M token batch size, a 2.5k-step warmup, and a constant 4e-4 learning rate. The community is now debating whether this behavior indicates a problem and how to potentially resolve it for stable training.

Why It Matters

Understanding these anomalies is crucial for reliably training the next generation of efficient, high-performance MoE models.