SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment
A 100-node H100 cluster proves open Ethernet can scale to elite AI supercomputing levels.
SAKURA Internet Research Center has unveiled SAKURAONE, a high-performance computing (HPC) cluster that demonstrates the viability of open, Ethernet-based networking for elite AI workloads. The system, built on their KOKARYOKU PHY bare-metal GPU platform, achieved a #49 ranking on the ISC 2025 TOP500 list with a measured performance of 33.95 PFLOP/s. Its defining feature is its fully open networking stack, utilizing 800 Gigabit Ethernet (GbE) with the SONiC network operating system, making it the only top-100 system to forgo proprietary interconnects like InfiniBand. The hardware comprises 100 nodes, each equipped with eight NVIDIA H100 GPUs, interconnected via a rail-optimized leaf-spine fabric using RoCEv2 for efficient data transfer, and backed by a massive 2 petabyte all-flash Lustre file system.
Beyond its specs, the paper provides a rare, detailed look at real-world AI cluster usage by tracking a single-tenant LLM development project. The researchers observed that while small, interactive jobs (like debugging and testing) were the most numerous, a few massive training jobs consumed the majority of the total GPU time. Crucially, as the project progressed from initial training phases to refinement, resource usage shifted from these large-scale jobs to a dominance of mid-scale jobs, reflecting an iterative development cycle of tuning and evaluation. This workload analysis offers valuable data-driven insights for planning and optimizing future AI research infrastructure, showing how compute demands evolve throughout an LLM's lifecycle.
- Ranked 49th on TOP500 using 800 GbE/SONiC, proving open networking scales to 33.95 PFLOP/s.
- Hardware includes 800 NVIDIA H100 GPUs across 100 nodes with a 2 PB all-flash storage system.
- Real workload analysis showed a shift from large-scale training to mid-scale iterative jobs as an LLM project matured.
Why It Matters
It validates open Ethernet as a cost-effective, scalable alternative to proprietary interconnects for cutting-edge AI supercomputing.