Genetic programming makes Vision Transformers 91.6% accurate on edge without retraining
No retraining needed: layer-specific scalar functions replace normalization in ViTs
Deploying Vision Transformers (ViTs) on edge devices has been hampered by the computational cost of layer normalization, which creates a global reduction bottleneck. Recent work replaced normalization with homogeneous scalar approximations, but those poorly fit different layers and required expensive retraining. In a new paper, Kieran Carrigg and colleagues propose a genetic programming (GP) framework that evolves layer-specific scalar functions directly from pre-trained weights. Their post-training re-alignment strategy adapts each layer individually, eliminating the need for full model retraining.
Results show the evolved expressions capture 91.6% of the target normalization variance (R²) versus just 70.2% for one-size-fits-all baselines. The modified ViT recovers 84.25% Top-1 accuracy on ImageNet-1K in only 20 epochs—preserving performance while removing the global reduction bottleneck. This creates a favorable trade-off between arithmetic complexity and off-chip memory traffic, removing a key barrier to efficient ViT inference on edge accelerators like mobile GPUs and FPGAs.
- Genetic programming evolves heterogeneous scalar functions for each ViT layer, directly from pre-trained weights without retraining
- Captures 91.6% of normalization variance (R²) vs 70.2% for homogeneous approximations
- Recovers 84.25% Top-1 ImageNet-1K accuracy in only 20 epochs, eliminating the global reduction bottleneck
Why It Matters
Enables Vision Transformers to run efficiently on edge accelerators, unlocking real-time computer vision on mobile and IoT devices.