Provable Adversarial Robustness in In-Context Learning
Theoretical breakthrough shows model robustness scales with √(capacity), revealing fundamental limits of in-context learning.
A new theoretical paper by researcher Di Zhang establishes provable guarantees for adversarial robustness in in-context learning (ICL), addressing a critical gap in understanding how large language models perform under distribution shifts. The work introduces a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts, moving beyond the common assumption that test tasks come from distributions similar to pretraining data.
Focusing on linear self-attention Transformers, the analysis derives non-asymptotic bounds linking adversarial perturbation strength (ρ), model capacity (m), and the number of in-context examples (N). The key finding reveals that model robustness scales with the square root of its capacity (ρ_max ∝ √m), meaning larger models inherently offer better protection against adversarial inputs. Simultaneously, the research shows adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude (N_ρ - N_0 ∝ ρ²), quantifying how much more data is needed for reliable performance under attack.
Experiments on synthetic tasks confirm these scaling laws, providing empirical validation of the theoretical framework. This work advances the theoretical understanding of ICL's fundamental limits under adversarial conditions and suggests that model capacity serves as a fundamental resource for distributional robustness. The findings have implications for how we design, evaluate, and deploy large language models in real-world applications where distribution shifts are inevitable.
- Proves model robustness scales with √(capacity) (ρ_max ∝ √m), showing larger models are inherently more robust
- Quantifies adversarial sample complexity penalty as ρ² (N_ρ - N_0 ∝ ρ²), revealing data requirements under attack
- Introduces distributionally robust meta-learning framework providing worst-case guarantees for in-context learning under Wasserstein shifts
Why It Matters
Provides theoretical foundation for building reliable AI systems that maintain performance under real-world distribution shifts and adversarial conditions.