Research & Papers

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

New mathematical framework reveals why normalization-free transformers like DyT are 10x more sensitive to initial settings.

Deep Dive

A new theoretical paper by researcher Sergey Alekseev provides a crucial mathematical framework for understanding why certain transformer architectures fail during training. The study introduces the Averaged Partial Jacobian Norm (APJN) as a measure of gradient amplification across layers and extends its analysis to transformers with bidirectional attention. The core finding is a clear 'criticality' picture: standard pre-LayerNorm transformers show controlled, power-law APJN growth, while normalization-free variants using elementwise nonlinearities like tanh exhibit problematic stretched-exponential growth.

This stretched-exponential behavior classifies architectures like Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers as 'subcritical.' The theory mathematically explains the empirical observation that these models are notoriously sensitive to initialization and optimization hyperparameters, often requiring extensive tuning for stable training. The analysis, validated against deep vision transformers, provides a predictive tool for architects, allowing them to foresee training instability issues before running costly experiments. This work bridges a gap between the empirical engineering of transformers and rigorous theoretical analysis of their training dynamics.

Key Points
  • Introduces APJN metric to analyze gradient flow in transformers, predicting training stability from initialization.
  • Identifies 'subcritical' behavior in normalization-free models (DyT/Derf) with stretched-exponential APJN growth.
  • Explains why these architectures are 10x more sensitive to hyperparameters than standard LayerNorm models.

Why It Matters

Provides a predictive theory for AI engineers to build more stable models and avoid costly failed training runs.