Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi-threaded Programs
New Distributed Clock and Epoch techniques slash synchronization overhead for deterministic replay.
Deep Dive
A team led by Xiang Fu and Kento Sato developed ReOMP, a new record-and-replay tool for OpenMP programs. It introduces Distributed Clock (DC) and Distributed Epoch (DE) recording schemes to eliminate excessive thread synchronization. The tool is 2-5x more efficient than traditional methods and integrates with MPI replay tools like ReMPI, enabling scalable debugging of complex HPC applications with minimal runtime overhead.
Why It Matters
This dramatically speeds up debugging for the massive parallel applications that power scientific computing and AI training.