Solving adversarial examples requires solving exponential misalignment
New research reveals why adversarial examples persist: AI's concept spaces are orders of magnitude larger than ours.
A team from Stanford University, including Alessandro Salvatore, Stanislav Fort, and Surya Ganguli, published a groundbreaking paper titled 'Solving adversarial examples requires solving exponential misalignment.' The research introduces a geometric framework for understanding why AI models remain vulnerable to adversarial attacks—subtle input perturbations that fool models but not humans. Their key innovation is defining a network's 'perceptual manifold' (PM) for a concept as the space of all inputs the model confidently assigns to that class. Strikingly, they found the dimensionality of these machine PMs is orders of magnitude higher than the dimensionality of natural human concepts.
This high-dimensional geometry creates an 'exponential misalignment.' Because volume grows exponentially with dimension, there are exponentially many inputs that machines confidently classify but humans would reject as nonsense. The paper posits this as the core origin of adversarial examples: a network's PM fills such a vast region of input space that any input is very close to some class's PM, making it easy to find confusing perturbations. The team tested this hypothesis across 18 different neural networks with varying robust accuracy. Their predictions held: both robust accuracy and the distance to a PM were negatively correlated with the PM's dimension. Crucially, even the most robust models today still exhibit this exponential misalignment; only the few PMs whose dimensionality approaches that of human concepts showed true alignment to human perception.
This work fundamentally connects two critical fields in AI safety: technical alignment (making models do what we intend) and adversarial robustness (making models reliable against attacks). It suggests that the 'curse of high dimensionality' in machine perception is a major, previously underappreciated roadblock. The findings imply that patching specific vulnerabilities may be insufficient; achieving truly robust AI may require a foundational redesign to align the geometric structure of machine learning with human cognition, moving beyond just optimizing for accuracy on a dataset.
- Defined 'Perceptual Manifolds' (PMs): The space of all inputs a network confidently assigns to a class, revealing machine concepts are orders of magnitude higher-dimensional than human ones.
- Found Exponential Misalignment: The high dimensionality leads to exponentially many nonsensical inputs that machines confidently classify, explaining the persistent existence of adversarial examples.
- Tested on 18 Networks: Confirmed robust accuracy is negatively correlated with PM dimension; even state-of-the-art robust models remain exponentially misaligned from human perception.
Why It Matters
This reveals a fundamental geometric flaw in current AI, suggesting robust, trustworthy systems require aligning machine 'concept spaces' with human cognition, not just patching bugs.