LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
Selectively ditching softmax attention in low-sensitivity layers boosts throughput by 68%.
LayerBoost, a new method from researchers at Ghent University and partners, tackles the quadratic bottleneck of softmax attention in transformer LLMs by applying a layer-aware reduction strategy. Rather than uniformly replacing attention across all layers—which often degrades quality—LayerBoost first performs a systematic sensitivity analysis on a pretrained model. This identifies which layers are critical for maintaining performance. For highly sensitive layers, standard softmax attention is retained; for moderately sensitive layers, it's replaced with linear sliding window attention; and for low-sensitivity layers, attention is removed entirely. A lightweight distillation-based healing phase, requiring only 10 million additional training tokens, recovers any performance lost from the architectural changes.
In testing, LayerBoost reduced inference latency and improved throughput by up to 68% under high concurrency, while maintaining competitive model quality. It matched base model performance on several benchmarks and showed only minor degradations on others, significantly outperforming state-of-the-art attention linearization methods. These efficiency gains make LayerBoost especially well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks. The paper is available on arXiv (2604.22050).
- LayerBoost selectively retains softmax attention in critical layers, uses linear sliding window attention in moderate layers, and removes attention in low-sensitivity layers.
- Achieves up to 68% throughput improvement at high concurrency with minimal quality degradation.
- Requires only 10M additional training tokens for a lightweight distillation-based recovery phase.
Why It Matters
Makes LLMs dramatically cheaper and faster to serve, enabling deployment on resource-constrained hardware without sacrificing accuracy.