Improving MPI Error Detection and Repair with Large Language Models and Bug References
New technique combines RAG, Chain-of-Thought, and bug databases to fix HPC code.
A research team led by Scott Piersall has developed a breakthrough method for detecting and repairing bugs in Message Passing Interface (MPI) programs, which are critical for high-performance computing and distributed machine learning training. Their paper, "Improving MPI Error Detection and Repair with Large Language Models and Bug References," addresses a major pain point: while LLMs like ChatGPT show promise for automated code repair, they struggle with MPI's complex process synchronization and message passing patterns, achieving only 44% accuracy when applied directly.
The researchers' solution combines three advanced techniques with LLMs. They use Retrieval-Augmented Generation (RAG) to pull in relevant bug examples from databases, Chain-of-Thought (CoT) prompting to guide the model through logical reasoning steps, and Few-Shot Learning to provide context. This hybrid approach doesn't just teach the model syntax—it teaches it the specific bug patterns and correct usage conventions found in real-world MPI programs. The result is a dramatic leap in performance, with error detection accuracy soaring to 77%, a 75% relative improvement over the baseline.
Crucially, the team demonstrated that their bug-referencing technique generalizes well across different large language models, not just ChatGPT. This suggests the framework could be adapted to work with newer, more capable models as they emerge. The work represents a significant step toward reliable AI-assisted maintenance for the complex, parallel code that underpins large-scale simulations and the training of foundation AI models themselves, where a single synchronization error can waste thousands of GPU hours.
- Hybrid AI technique combines RAG, Chain-of-Thought, and bug databases to understand MPI error patterns
- Boosted ChatGPT's MPI error detection accuracy from 44% to 77%, a 75% relative improvement
- Method generalizes to other LLMs and automates repair for PyTorch/TensorFlow distributed training code
Why It Matters
Automates debugging of complex parallel code, saving time and preventing costly errors in large-scale AI training and scientific simulations.