Research & Papers

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Cutting tensor-parallel communication overhead by 47% with near-lossless accuracy...

Deep Dive

Researchers from multiple institutions have developed TACO (Tensor-parallel Adaptive COmmunication compression), a novel FP8-based framework designed to address the critical communication bottleneck in large-scale tensor-parallel LLM training. The framework tackles the challenge of compressing intermediate tensors that exhibit dense, near-zero distributions, which typically exacerbate errors under frequent communication. TACO employs a data-driven reshaping strategy combined with an Adaptive Scale-Hadamard Transform to enable high-fidelity FP8 quantization, while its Dual-Scale Quantization mechanism ensures numerical stability throughout the training process.

TACO's key innovation includes a highly fused compression operator that significantly reduces memory traffic and kernel launch overhead, allowing efficient overlap with communication. The framework integrates seamlessly with existing state-of-the-art methods for data and pipeline parallelism to create a compression-enabled 3D-parallel training system. In detailed experiments on GPT models and the Qwen model, TACO demonstrated up to 1.87x end-to-end throughput improvement while maintaining near-lossless accuracy. Accepted by HPDC'26, the framework represents a practical solution for scaling LLM training to larger clusters without sacrificing model quality.

Key Points
  • TACO uses FP8 quantization with Adaptive Scale-Hadamard Transform for high-fidelity compression of intermediate tensors
  • Achieves up to 1.87x end-to-end throughput improvement on GPT and Qwen models
  • Integrates with data and pipeline parallelism to create a compression-enabled 3D-parallel training framework

Why It Matters

TACO slashes communication overhead in large-scale LLM training, enabling faster iterations and larger models without accuracy loss.