Research & Papers

Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

A new technique overcomes the key bottleneck for deploying fast, long-context Mamba models in production.

Deep Dive

A team of researchers has published a pivotal paper, 'Scaling State-Space Models on Multiple GPUs with Tensor Parallelism,' which solves a major engineering hurdle for deploying next-generation AI models. Selective state-space models (SSMs), such as Mamba, have emerged as a powerful and efficient alternative to Transformer architectures, particularly for handling long sequences. However, their inference speed has been bottlenecked by the memory and compute limits of single GPUs, as traditional tensor parallelism techniques used for Transformers don't work due to the SSM's unique recurrent structure. This new work presents a communication-efficient TP design specifically tailored for SSM blocks, enabling them to be split effectively across multiple accelerators for the first time.

The technical innovation lies in three key solutions: enabling a state cache to improve Time-To-First-Token (TTFT), partitioning the model's parameter tensor to keep recurrent updates local, and using quantized AllReduce to slash synchronization overhead. The team evaluated their method on SSM-based models including Mamba, Falcon-Mamba, and Zamba on NVIDIA A6000 and A100 clusters. Results show substantial throughput gains, with batch-request throughput improving by ~1.6-2.1x on 2 GPUs and a dramatic ~2.6-4.0x on 4 GPUs for pure Mamba models, with benefits increasing at longer context lengths. The quantized communication provided an additional 10-18% boost. This breakthrough paves the way for the practical, large-scale deployment of SSMs, making their superior efficiency for long-context tasks finally accessible in real-world applications.

Key Points
  • Enables 2.6-4.0x higher throughput for Mamba models when scaled across 4 GPUs, with the largest gains at long context lengths.
  • Solves the non-trivial challenge of applying tensor parallelism to SSMs by keeping recurrent state updates local and minimizing synchronization.
  • Uses quantized AllReduce for communication, yielding a further 10-18% throughput improvement by reducing bandwidth overhead.

Why It Matters

Unlocks production-scale deployment for efficient, long-context SSM models like Mamba, making them viable alternatives to costly Transformers.