HexiSeq boosts LLM training throughput 1.72x on mixed GPU clusters
New system optimizes long-context training across H100s, A100s, and beyond.
Long-context training of large language models typically relies on context parallelism (CP) and head parallelism (HP) under the assumption of homogeneous GPU meshes. In practice, many production clusters contain a mix of GPU models (e.g., H100s and A100s) with varying memory, compute, and network bandwidth. HexiSeq, introduced by researchers Yan Liang, Youhe Jiang, and colleagues, addresses this gap by enabling fully asymmetric CP–HP partitioning. It assigns sequence shards and attention heads according to each device's capabilities, formulating the allocation as a constrained optimization problem solved by an efficient hierarchical scheduler.
Evaluated across models from 3B to 70B parameters and context lengths up to one million tokens, HexiSeq delivers meaningful throughput gains. On mixed H100–A100 testbeds, it achieves an average 1.11x speedup (up to 1.19x). In simulated clusters with 32–128 GPUs spanning up to four GPU models, average throughput improves by 1.36x (up to 1.72x). Importantly, on FLOP-comparable pairs against homogeneous clusters, HexiSeq reaches throughput close to the strongest homogeneous baseline—demonstrating that heterogeneous hardware can be efficiently leveraged for long-context LLM training without sacrificing performance.
- Extends context and head parallelism to heterogeneous GPU clusters (mixed H100, A100, etc.) with non-uniform network bandwidth.
- Uses constrained optimization and a hierarchical scheduler to assign sequence shards and attention heads per device capabilities.
- Achieves up to 1.72x throughput improvement in simulations with 32–128 GPUs across four GPU models, and up to 1.19x on real H100–A100 testbeds.
Why It Matters
Enables efficient long-context LLM training on mixed hardware, reducing costs and unlocking older GPUs for modern workloads.