Research & Papers

IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs

The new AI tool analyzes complex storage system logs to find bottlenecks, freeing scientists from expert dependency.

Deep Dive

A team of researchers from institutions including Lawrence Berkeley National Laboratory and the University of North Texas has introduced IOAgent, a novel AI system designed to automate the diagnosis of Input/Output (I/O) performance bottlenecks in High-Performance Computing (HPC) storage systems. As HPC storage stacks grow more complex, domain scientists face significant challenges in achieving optimal I/O performance, traditionally relying on a scarce pool of human experts to analyze trace logs. IOAgent addresses this bottleneck by leveraging Large Language Models (LLMs) to provide trustworthy, automated diagnosis, bringing expert-level capability directly to scientists. The work was formally presented at the 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

The system tackles key LLM challenges like handling long context windows and avoiding hallucinations by integrating a three-part architecture: a module-based pre-processor, a Retrieval-Augmented Generation (RAG) engine that incorporates accurate HPC I/O domain knowledge, and a tree-based merger for final diagnosis. It analyzes standard Darshan trace files and provides detailed, referenced justifications, along with an interactive interface for follow-up questions. To rigorously evaluate IOAgent, the team created and released TraceBench, the first open diagnosis test suite comprising a diverse set of labeled job traces. Results show IOAgent performs on par with or better than state-of-the-art diagnosis tools and is model-agnostic, delivering consistent results with both proprietary models like GPT-4 and open-source alternatives. This marks a significant step toward scalable, expert-level support for data-intensive scientific computing.

Key Points
  • Uses a RAG-based architecture to integrate precise HPC I/O knowledge, reducing LLM hallucinations.
  • Evaluated on the new open-source TraceBench dataset, matching or beating existing diagnostic tools.
  • Model-agnostic design works effectively with both proprietary (e.g., GPT-4) and open-source LLMs.

Why It Matters

Democratizes access to high-level HPC performance tuning, accelerating scientific discovery by reducing dependency on scarce human experts.