Research & Papers

[D] How ZeRO-1 could be faster than ZeRO-2?

Empirical data shows ZeRO-1 outperforms ZeRO-2 in distributed training, defying conventional wisdom.

Deep Dive

Microsoft's ZeRO (Zero Redundancy Optimizer) memory optimization technique has researchers questioning why ZeRO-1 often outperforms ZeRO-2 in real-world distributed training. DeepSeek V3's training used ZeRO-1 over ZeRO-2, and Microsoft's Ultra-Scale Playbook found ZeRO-1 faster despite requiring identical communication. The paradox centers on why keeping gradients on all nodes (ZeRO-1) beats sharding them (ZeRO-2) when both shard optimizer states.

Why It Matters

This optimization mystery could lead to faster, cheaper training of massive AI models like DeepSeek V3.