Achieves up to 9.78x speedup over state-of-the-art baselines and throughput comparable to distributed systems?

Achieves up to 9.78x speedup over state-of-the-art baselines and throughput comparable to distributed systems

First work to leverage NVMe SSDs for full-graph GNN training via structured storage offloading (SSO)?

First work to leverage NVMe SSDs for full-graph GNN training via structured storage offloading (SSO)

Enables large-scale full-graph training on a single GPU with partition-wise caching and gradient regathering?

Enables large-scale full-graph training on a single GPU with partition-wise caching and gradient regathering

Research & Papers

GriNNder enables single-GPU full-graph GNN training at 9.78x speedup

arXiv cs.DC May 13, 2026

⚡Offload GNN training to SSDs and beat distributed systems on a single GPU

Deep Dive

Full-graph training of graph neural networks (GNNs) preserves complete neighborhood information but typically requires multiple GPUs or servers, incurring high hardware and communication costs. Existing single-server methods remain bottlenecked by GPU and host memory as graph sizes grow. GriNNder, developed by researchers at (presumably Korean institutions), is the first work to break this memory wall by offloading data to NVMe SSDs, which offer multi-terabyte capacities and bandwidths exceeding 10 GB/s. The key innovation is structured storage offloading (SSO), a framework that manages the GPU-host-storage hierarchy through coordinated cache, regather, and bypass mechanisms.

GriNNder implements three core techniques: (i) a partition-wise caching strategy for host memory that exploits cross-partition dependencies, (ii) a regathering strategy for gradient computation that eliminates redundant storage operations, and (iii) a lightweight partitioning scheme that reduces the memory footprint of existing graph partitioners. In experiments across various models and datasets, GriNNder achieves up to 9.78x speedup over state-of-the-art baselines and delivers throughput comparable to distributed systems. This makes previously infeasible large-scale full-graph training possible on a single GPU. The paper has been accepted to MLSys 2026.

Key Points

Achieves up to 9.78x speedup over state-of-the-art baselines and throughput comparable to distributed systems
First work to leverage NVMe SSDs for full-graph GNN training via structured storage offloading (SSO)
Enables large-scale full-graph training on a single GPU with partition-wise caching and gradient regathering

Why It Matters

Democratizes large-scale GNN training by eliminating multi-GPU requirements, cutting hardware costs and communication overhead.

Read Original Article

GriNNder enables single-GPU full-graph GNN training at 9.78x speedup

Why It Matters

Related Articles

🚀 Stay Ahead in AI