ModTrans: Translating Real-world Models for Distributed Training Simulator
A new translator eliminates the barrier between ML model developers and distributed system researchers.
Distributed training is critical for developing today's massive AI models like GPT-4 and Claude 3, but experimenting with different hardware configurations is prohibitively expensive. Simulators like ASTRA-sim exist to model this process, but they've had a fundamental flaw: they couldn't accept real, production-ready model architectures as input, forcing researchers to use simplified, artificial examples. This created a disconnect between the machine learning experts building the models and the systems researchers optimizing the training infrastructure.
Researcher Yi Lyu's new tool, ModTrans, directly solves this problem. It acts as a translator, converting models developed in real-world frameworks (like PyTorch or TensorFlow) into the specific format required by the ASTRA-sim simulator. The paper states the translation process adds negligible overhead, meaning researchers can now take an existing model—whether it's a vision transformer or a large language model—and immediately simulate how it would perform across thousands of GPUs with different network topologies. This bridges a critical gap, allowing for more accurate and practical co-design of future AI models and the supercomputers that train them.
- ModTrans translates real-world AI model formats for the ASTRA-sim distributed training simulator.
- The translation process adds negligible computational cost, according to the experiment results.
- It removes a major barrier, allowing ML and systems researchers to collaborate on training optimization.
Why It Matters
Enables cheaper, faster simulation of large-scale AI training, accelerating the development of more efficient models and hardware systems.