Research & Papers

Information-Theoretic Generalization Bounds for Stochastic Gradient Descent with Predictable Virtual Noise

New paper adapts perturbation covariances to gradient history for provably tighter bounds

Deep Dive

A new theoretical paper by Mohammad Partohaghighi tackles a core limitation in information-theoretic generalization bounds for stochastic gradient descent (SGD). Existing virtual perturbation analyses add auxiliary Gaussian noise only in the proof to make mutual information tractable, but they require perturbation covariances to be fixed independently of the optimization history. This restricts their ability to represent the geometries induced by moving gradient statistics, preconditioners, and curvature proxies—common features of modern adaptive optimizers.

The paper introduces predictable history-adaptive virtual perturbations, where the perturbation covariance at each iteration may depend on the past real SGD history but not on current or future randomness. This predictability enables a conditional Gaussian relative-entropy argument, yielding bounds that replace fixed sensitivity and gradient-deviation terms with conditional adaptive counterparts. The result includes an output-sensitivity penalty from accumulated perturbation covariance and reduces deviation to a conditional variance only under unbiasedness. Crucially, the framework recovers fixed isotropic and geometry-aware bounds (e.g., deterministic or public covariance rules) as special cases, providing a unified theoretical lens for analyzing adaptive optimization without altering the actual training algorithm.

Key Points
  • Perturbation covariance can depend on past SGD history but not on future or current randomness, maintaining tractability
  • New bounds replace fixed sensitivity terms with conditional adaptive counterparts and include a covariance-comparison cost (KL divergence)
  • Recovers existing fixed-noise bounds (isotropic, geometry-aware) under admissible synchronization rules like deterministic or public covariances

Why It Matters

Tighter theoretical guarantees for adaptive optimizers (e.g., Adam) used in large-scale training, without modifying the algorithm