Research & Papers

GQLA: New attention mechanism adapts LLM decoding to any GPU hardware

No retraining needed – GQLA lets LLMs run efficiently on both H100 and H20 GPUs.

Deep Dive

Fanxu Meng's new paper introduces Group-Query Latent Attention (GQLA), a clever twist on DeepSeek-V2/V3's Multi-head Latent Attention (MLA). While MLA achieves near-perfect roofline performance on H100-class GPUs, its single decoding path – an absorbed MQA form – ties efficient inference to specific compute-bandwidth ratios. This makes it suboptimal on commodity hardware like the export-restricted H20, and prevents gains from Multi-Token Prediction (MTP) or tensor parallelism along the head axis. GQLA solves this by building two algebraically equivalent decoding paths into a single set of trained weights: an MQA-absorb path identical to MLA, and a GQA path with a per-group expanded cache. The runtime simply selects the best path for the target hardware – no retraining, no custom kernels needed.

On LLaMA-3-8B, GQLA's MQA-absorb path compresses the per-token KV cache to just 28.125% of the baseline GQA model, while the GQA path preserves the traffic patterns needed for efficient inference on lower-bandwidth GPUs. GQLA also supports up to 8-way zero-redundancy tensor parallelism on the GQA path, unlocking scalable deployment. To avoid pretraining from scratch, Meng extends TransMLA into TransGQLA: a conversion technique that transforms a pretrained GQA checkpoint into a GQLA model. This makes GQLA immediately practical for existing LLMs like LLaMA-3.

The key insight is that a single architecture can adapt to diverse hardware without architectural changes or additional training. For AI infrastructure teams, this means one model can ship to both high-end H100 clusters and constrained H20 deployments, maintaining throughput by dynamically switching between attention paths. The method also enables MTP on GQA paths, which can boost token generation speed on mid-range GPUs. GQLA is a pragmatic step toward hardware-adaptive LLM inference that doesn't sacrifice the efficiency gains of latent attention.

Key Points
  • GQLA exposes two equivalent decoding paths (MQA-absorb and GQA) from the same weights, allowing runtime hardware selection.
  • On LLaMA-3-8B, the MQA-absorb path compresses the KV cache to 28.125% of the GQA baseline.
  • Supports up to 8-way zero-redundancy tensor parallelism on the GQA path and can be applied via TransGQLA conversion without retraining.

Why It Matters

Enables single LLM to achieve peak performance across diverse GPU hardware without retraining or custom kernels.