CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems
New AI system adapts to real-time I/O patterns, achieving 3x speedups over default configurations.
A research team from Texas Tech University and Oak Ridge National Laboratory has introduced CARAT (Client-Side Adaptive RPC and Cache Co-Tuning), a machine learning framework designed to optimize parallel file systems in high-performance computing environments. The system addresses longstanding challenges in HPC I/O performance by enabling real-time, client-side tuning of critical parameters without requiring global coordination or pattern-dependent configurations. Unlike previous autotuning approaches that lacked scalability and online operation capabilities, CARAT allows each compute node to independently adapt to dynamic I/O patterns and system conditions, responding to changes in application behavior and network states as they occur.
The framework leverages only locally observable metrics to make intelligent tuning decisions, co-optimizing both RPC (remote procedure call) and caching parameters simultaneously. Researchers prototyped CARAT using the Lustre parallel file system and conducted extensive evaluations across diverse I/O patterns, real-world HPC workloads, and multi-client deployments. Results demonstrated performance improvements of up to 3x compared to default or statically configured systems, validating the approach's effectiveness and generality. The lightweight, scalable nature of CARAT makes it suitable for deployment in existing HPC infrastructures, potentially benefiting various data-intensive applications from scientific simulations to AI training workloads. The research will be presented at the 40th IEEE International Parallel & Distributed Processing Symposium in 2026.
- Achieves up to 3x performance improvement over default configurations in parallel file systems
- Enables independent, client-side tuning using only locally observable metrics without global coordination
- Successfully prototyped with Lustre file system and tested with real-world HPC workloads
Why It Matters
Dramatically accelerates data-intensive HPC and AI workloads by optimizing I/O performance in real-time, reducing computational bottlenecks.