Convergent Abstraction Hypothesis: AI learns like convergent evolution
Sharks and dolphins evolved similar shapes; AI models may converge on identical concepts.
The Convergent Abstraction Hypothesis, introduced by Jan_Kulveit on LessWrong, suggests that cognitive systems—whether biological brains or AI models—independently evolve similar abstractions when trained in comparable environments and under analogous selection pressures. Drawing from convergent evolution (e.g., sharks and dolphins sharing streamlined bodies due to hydrodynamics), Kulveit argues that abstractions like 'objectness' or arithmetic emerge repeatedly because they offer extreme compression: an apple can be represented from ~10^25 degrees of freedom down to a handful of variables like position and momentum. This compression is a universal pressure—from physics to SGD-trained neural networks—and many mathematical structures (PCA, Fourier transforms) serve the same efficient-coding role.
However, Kulveit stresses that this convergence is contingent, not inevitable. Just as squids evolved differently from sharks due to lacking a vertebral column, changes in architecture, training data distribution, or optimization regime could disrupt convergence. The hypothesis offers a middle ground: empirical evidence of convergence (e.g., vision models learning edge detectors or word embeddings clustering similar meanings) is real and useful for alignment, but practitioners should not assume these abstractions are universally natural or stable. This matters for safety, as robust alignment may need to account for fragility when models encounter distribution shifts or different objectives.
- Convergence driven by universal pressure to compress: an apple's 10^25 degrees of freedom reduced to a few variables like position.
- Shared physical environment (e.g., translation invariance) forces similar abstractions across species and AI systems.
- Unlike 'natural abstractions', this hypothesis acknowledges fragility: changes in architecture or training can break convergence.
Why It Matters
For AI alignment, it warns that learned abstractions may be contingent, not universal, requiring careful validation across different models.