Research & Papers

New Theory Explains Gradient Descent Spikes in Bilevel Optimization and Adversarial Training

Transient amplification in coupled gradient descent can exceed spectral radius predictions by 2x or more.

Deep Dive

A new theoretical paper from Ahanaf Hasan Ariq, accepted as a poster at the HiLD 2026 workshop (co-located with ICML 2026), tackles a persistent blind spot in optimization theory: transient amplification in coupled gradient descent. In systems like bilevel optimization and adversarial training, one parameter block's update depends on another, creating block-triangular Jacobians. While asymptotic stability is governed by spectral radii of diagonal blocks, the system can exhibit arbitrarily large transient spikes before converging — a phenomenon invisible to standard spectral analysis. The paper provides rigorous pseudospectral bounds for this behavior.

The core result is a sharp bound on the Kreiss constant: K(J) ≤ 2/(1-γ) + ||C||/(4(1-γ)) when diagonal blocks are symmetric with spectral radius < 1. This yields a finite-horizon iteration complexity of O(K(J)^2 log(1/δ)) for stochastic coupled descent, exposing a non-asymptotic, instance-dependent regime. The work also characterizes the critical coupling threshold for spectral instability and extends the analysis via Neumann-series perturbation. Experiments on linear-quadratic problems and neural network training confirm the theory, offering practical guidance for designing stable training loops in meta-learning and robust ML.

Key Points
  • Kreiss constant bound K(J) ≤ 2/(1-γ) + ||C||/(4(1-γ)) quantifies worst-case transient amplification for block-triangular Jacobians
  • Finite-horizon iteration complexity O(K(J)^2 log(1/δ)) for stochastic coupled descent, tighter than traditional spectral radius bounds
  • Critical coupling threshold identified for spectral instability, validated on linear-quadratic problems and neural network training

Why It Matters

Enables engineers to predict and mitigate training instability in meta-learning, adversarial training, and two-time-scale optimization.