ActTail: Global Activation Sparsity in Large Language Models
New research from Wenwen Hou et al. uses heavy-tail theory to cut LLM compute by 80% while improving accuracy.
A team of researchers led by Wenwen Hou has introduced ActTail, a novel method for implementing global activation sparsity in large language models (LLMs). Activation sparsity is a technique to speed up inference by skipping computations on less important neural activations. The key innovation of ActTail is that it moves beyond a one-size-fits-all approach. Instead, it analyzes the unique statistical properties—specifically the 'heavy-tail exponent'—of each projection layer within a Transformer model. This allows the system to intelligently allocate higher sparsity budgets to layers that can handle it, minimizing performance loss.
The method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, providing a principled, mathematical framework for sparsity allocation rather than relying on heuristics. The paper establishes a direct theoretical relationship between a projection's heavy-tail exponent and its optimal sparsity ratio. In practical tests on models like LLaMA-2 and Mistral-7B, ActTail delivered dramatic improvements. At an aggressive 80% sparsity level (meaning 80% of activation computations are skipped), it not only maintained performance but actually reduced model perplexity—a measure of prediction accuracy—by up to 40.1% on LLaMA-2-13B compared to uniform sparsity methods.
This breakthrough is significant because it tackles a major bottleneck in deploying LLMs: computational cost and latency during inference. By enabling models to run significantly faster with less memory movement and potentially lower power consumption, ActTail paves the way for more efficient deployment on edge devices or in cost-sensitive cloud environments. The performance gains at high sparsity levels suggest this method could be a key component in the next generation of efficient, high-performance AI models.
- Uses Heavy-Tailed Self-Regularization theory to assign custom, non-uniform sparsity budgets to each neural projection layer.
- Achieved a 40.1% reduction in perplexity on LLaMA-2-13B at 80% global activation sparsity versus uniform methods.
- Provides a theoretical framework linking sparsity ratio to layer statistics, moving beyond heuristic design for model optimization.
Why It Matters
Enables faster, cheaper LLM inference with higher accuracy, critical for real-time applications and scaling AI services.