AMD's ROCm 7.2 still fails with NaN errors on custom PyTorch research code
RX 7900XTX hits NaNs on backward pass while RTX 3090 runs fine.
A Reddit user (QuantumQuokka) reported a frustrating experience with AMD's ROCm platform for machine learning research. After procuring an RX 7900XTX reference card, they ported a small codebase for training flow matching models (SANA architecture) from their existing RTX 3090 setup. The environment used PyTorch 2.12 with ROCm 7.2, keeping code identical aside from the pip environment. While forward passes executed without issues, calling backward() immediately produced NaNs. The user tried switching between bf16 and fp32, and tweaking various environment variables, all to no avail. Notably, the standard nanoGPT training script ran perfectly, suggesting the ROCm team tests only well-established benchmarks.
This experience underscores a persistent gap: AMD's ROCm stack still lacks the robustness of Nvidia's CUDA for experimental, non-standard research workloads. The user speculates that ROCm passes validation on popular repos but breaks on even slightly uncommon custom operations. For the AI research community, this means AMD GPUs remain a risky alternative for frontier work despite competitive hardware specs. Until AMD invests in broader regression testing and backward-pass stability, researchers building novel architectures will likely continue to rely on Nvidia's ecosystem.
- Researcher tested RX 7900XTX with ROCm 7.2 and PyTorch 2.12, identical code to working RTX 3090 setup.
- Forward passes succeeded but backward passes produced NaNs on custom flow matching model (SANA) regardless of precision (bf16/fp32).
- Only standard benchmarks like nanoGPT worked; user concludes ROCm is fragile on uncommon research code.
Why It Matters
AMD's ROCm remains unreliable for experimental ML research, limiting GPU alternatives to Nvidia's CUDA ecosystem.