Meta’s ScaleAcross Explorer speeds up AI training across data centers by 64%
New optimizer handles hundreds of thousands of GPUs across multiple buildings.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Meta has open-sourced ScaleAcross Explorer, a novel optimizer designed to tackle the immense communication challenges of training large language models across multiple data center buildings and geographic regions—a paradigm the company calls “scale-across” training. As frontier models scale to hundreds of thousands of GPUs spread across dozens of buildings, the system design space has become incredibly complex, involving heterogeneous hardware, new model architectures, and evolving communication patterns. The research team, drawing from Meta's production experience, conducted in-depth characterization of three critical design dimensions: parallelism placement (how model and data are split across GPUs), parallelism scheduling (when and where communication happens), and network layer technologies (e.g., RoCE, InfiniBand).
ScaleAcross Explorer then holistically optimizes all three dimensions together, exploring trade-offs that traditional approaches miss. In testbed experiments and large-scale simulations, the optimizer delivered up to 64.62% training speedups over Meta's current production configurations and up to 37.59% speedups over the state-of-the-art baseline across a wide range of design points. The paper (28 pages, 27 figures) provides a detailed framework for the AI infrastructure community. This work is especially timely as training clusters move beyond single data centers, and companies like Meta, Google, and OpenAI race to build models requiring exascale compute. The tool is available on arXiv and could significantly reduce both training time and carbon footprint for next-generation AI systems.
- ScaleAcross Explorer achieves up to 64.62% training speedups over Meta's production configurations and 37.59% over state-of-the-art baselines.
- The optimizer jointly optimizes parallelism placement, parallelism scheduling, and network layer technologies across hundreds of thousands of GPUs.
- The research provides practical guidance for deploying large-scale training jobs across multiple data center buildings and regions.
Why It Matters
Cutting multi-datacenter training time by over 60% accelerates frontier AI development and slashes infrastructure costs.