HELLoRA attaches LoRA adapters only to the top activated experts per layer, cutting adapter FLOPs by 38.7% on OlMoE?

HELLoRA attaches LoRA adapters only to the top activated experts per layer, cutting adapter FLOPs by 38.7% on OlMoE.

On DeepSeekMoE, HELLoRA outperforms standard LoRA while using only 23.2% of its trainable parameters?

On DeepSeekMoE, HELLoRA outperforms standard LoRA while using only 23.2% of its trainable parameters.

Training throughput increases 1.9x on OlMoE without sacrificing accuracy—in fact accuracy improves by 9.2%?

Training throughput increases 1.9x on OlMoE without sacrificing accuracy—in fact accuracy improves by 9.2%.

Research & Papers

HELLoRA cuts 84% of LoRA parameters for MoE models while boosting accuracy

arXiv cs.LG May 20, 2026

⚡Uses only 15.7% of LoRA's parameters yet improves accuracy by 9.2% on OlMoE.

Deep Dive

Low-Rank Adaptation (LoRA) has become the go-to method for parameter-efficient fine-tuning of large language models, but most variants are designed for dense architectures. Mixture-of-Experts (MoE) models, which scale total parameters while keeping per-token compute nearly constant, present a unique opportunity: their sparse activation patterns mean that only a subset of experts are used for any given input. Prior approaches blindly apply LoRA to all layers, wasting compute and memory on experts that rarely activate.

HELLoRA (Hot-Experts Layer-level Low-Rank Adaptation) flips this by attaching LoRA modules exclusively to the most frequently activated experts in each layer. The paper tests this across three MoE backbones: OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE, and three task families—mathematical reasoning, code generation, and safety alignment. Results are striking: on OlMoE, HELLoRA uses only 15.7% of vanilla LoRA's trainable parameters, reduces adapter FLOPs by 38.7%, boosts training throughput by 1.9x, and improves accuracy by 9.2%. On DeepSeekMoE it outperforms LoRA with just 23.2% of its parameters. The authors attribute this to structured regularization that preserves pretrained expert specialization. Combined with LoRI to form HELLoRI, the method handles extreme parameter budgets by freezing up-projections and sparsifying down-projections.

Key Points

HELLoRA attaches LoRA adapters only to the top activated experts per layer, cutting adapter FLOPs by 38.7% on OlMoE.
On DeepSeekMoE, HELLoRA outperforms standard LoRA while using only 23.2% of its trainable parameters.
Training throughput increases 1.9x on OlMoE without sacrificing accuracy—in fact accuracy improves by 9.2%.

Why It Matters

Enables much cheaper fine-tuning of large MoE models, critical for deploying frontier models like Mixtral and DeepSeek at scale.

Read Original Article

HELLoRA cuts 84% of LoRA parameters for MoE models while boosting accuracy

Why It Matters

Related Articles

🚀 Stay Ahead in AI