Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
New activation family outperforms GELU on GPT-2, ResNet-56, and BERT-small.
Eylon Krause's new paper introduces Geometric Monomial (GEM), a family of activation functions that achieve ReLU-like performance with purely rational arithmetic and C^{2N} smoothness. The family includes three variants: GEM (base), E-GEM (epsilon-parameterized for arbitrary L^p approximation of ReLU), and SE-GEM (piecewise variant eliminating dead neurons). An N-ablation study found N=1 optimal for standard-depth networks, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%. The smoothness parameter reveals a CNN-transformer tradeoff: N=1 for deep CNNs, N=2 for transformers.
On benchmarks, GEM variants consistently beat GELU. SE-GEM (epsilon=10^{-4}) surpassed GELU on CIFAR-10 + ResNet-56 (92.51% vs 92.44%). On GPT-2 (124M), GEM achieved the lowest perplexity (72.57 vs 73.76 for GELU). E-GEM (epsilon=10) achieved the best validation loss on BERT-small (6.656). The epsilon parameterization reveals a scale-dependent optimum: small epsilon for deep CNNs and large transformers, large epsilon for small transformers like BERT-small due to limited depth and unconstrained gradients.
- GEM family uses rational arithmetic for C^{2N} smoothness, outperforming GELU on GPT-2 (72.57 vs 73.76 perplexity).
- SE-GEM achieves 92.51% accuracy on CIFAR-10 + ResNet-56, beating GELU's 92.44%.
- Optimal smoothness N=1 for CNNs, N=2 for transformers; epsilon parameterization enables scale-dependent tuning.
Why It Matters
GEM offers a drop-in activation upgrade that boosts accuracy across CNNs and transformers with purely rational math.