[D] How ZeRO-1 could be faster than ZeRO-2?
Empirical data shows ZeRO-1 outperforms ZeRO-2 in distributed training, defying conventional wisdom.
Deep Dive
Microsoft's ZeRO (Zero Redundancy Optimizer) memory optimization technique has researchers questioning why ZeRO-1 often outperforms ZeRO-2 in real-world distributed training. DeepSeek V3's training used ZeRO-1 over ZeRO-2, and Microsoft's Ultra-Scale Playbook found ZeRO-1 faster despite requiring identical communication. The paradox centers on why keeping gradients on all nodes (ZeRO-1) beats sharding them (ZeRO-2) when both shard optimizer states.
Why It Matters
This optimization mystery could lead to faster, cheaper training of massive AI models like DeepSeek V3.