AdapTBF: Decentralized Bandwidth Control via Adaptive Token Borrowing for HPC Storage
New decentralized system prevents small jobs from hogging storage bandwidth, improving overall I/O efficiency by 30%.
Researchers Md Hasanur Rashid and Dong Dai have published a paper introducing AdapTBF, a novel decentralized bandwidth control system designed specifically for high-performance computing (HPC) storage environments. The system addresses a critical problem in modern HPC infrastructure where applications running on compute resources share global storage systems, often leading to inefficient bandwidth allocation. Small jobs with bursty I/O patterns can consume disproportionate storage bandwidth, blocking larger jobs allocated many compute nodes and resulting in significant resource waste. AdapTBF builds upon existing Token Bucket Filter (TBF) implementations in parallel file systems like Lustre but introduces adaptive borrowing and lending mechanisms to overcome the limitations of strict proportional bandwidth limits.
The technical innovation lies in AdapTBF's decentralized approach that allows applications to temporarily borrow unused bandwidth tokens from other applications during bursty phases, then return them when idle. This adaptive mechanism maximizes both per-application performance and overall storage efficiency while maintaining fairness across jobs of different scales. The researchers implemented AdapTBF in Lustre and evaluated it using synthetic workloads modeled after real-world HPC scenarios, demonstrating effective I/O bandwidth management even under extreme conditions. The system represents a significant advancement over traditional static allocation methods, potentially reducing resource waste in large-scale scientific computing environments where storage bottlenecks can significantly impact research timelines and computational efficiency.
- Prevents small jobs from blocking large-scale HPC applications by managing I/O bandwidth proportionally to compute resources
- Introduces adaptive borrowing/lending mechanism that improves upon traditional Token Bucket Filter implementations in systems like Lustre
- Maintains high storage utilization (up to 30% improvement) while ensuring fairness across different job scales and bursty I/O patterns
Why It Matters
Reduces resource waste in scientific computing by optimizing storage bandwidth allocation, potentially accelerating large-scale research projects.