Developer Tools

trunk/90896d47b1ab5ef4cf64e1318f31d7e8c26e8648: trunc_normal_ low precision fix (#174997)

A single precision bug in PyTorch's trunc_normal_ function was creating 1000σ outliers that destroyed training stability.

Deep Dive

The PyTorch team has resolved a critical numerical precision bug in their `torch.nn.init.trunc_normal_` function (Pull Request #174997), which is widely used for initializing neural network weights. The issue stemmed from how the function handled low-precision floating-point calculations when generating truncated normal distributions. Specifically, when using the `erfinv` (inverse error function) approach with parameters like std=0.002 and bounds [-2,2], the cumulative distribution function values would underflow to exactly 0 or 1 in fp32 precision. This caused `erfinv` to produce near-infinite values for inputs very close to ±1, resulting in extreme outliers that were then clamped to the boundary values. These clamped values represented 1000 standard deviation outliers that dramatically skewed statistical properties while appearing harmless in quantile analysis.

The technical impact was severe: in a test with 67 million samples, just two clamped elements at ±2.0 caused the kurtosis (fourth moment) to spike to 28,122 instead of the expected ~3.0—a 10,000x inflation. The standard deviation was also inflated from 0.002000 to 0.002029. For deep learning practitioners, this meant weight initialization could introduce destructive noise that compromised training stability, particularly in sensitive architectures where variance preservation is crucial. The fix replaces the problematic `erfinv` implementation with rejection sampling using CUDA's native Box-Muller normal generator (via `curand`), which doesn't suffer from the same precision collapse near distribution boundaries. This ensures proper statistical properties and eliminates the catastrophic outliers that could silently degrade model performance.

Key Points
  • The `trunc_normal_` function's low-precision `erfinv` implementation created 1000σ outliers that were clamped to boundary values
  • Just 2 extreme values in 67M samples inflated kurtosis from ~3.0 to 28,122 (10,000x error) and distorted variance
  • Fix switches to rejection sampling with CUDA's Box-Muller generator, eliminating precision issues near distribution boundaries

Why It Matters

Weight initialization bugs can silently destroy training stability in neural networks, making this fix critical for reproducible deep learning research.