Research & Papers

New replication scheme shields supercomputers from silent data corruption

Detects and corrects corrupted tasks by replaying only the affected parts.

Deep Dive

Silent data corruption (SDC) – where hardware errors silently alter computation results – is a growing threat as supercomputer clusters scale up. Traditional replication (running everything twice and comparing) becomes impractical in modern asynchronous many-task (AMT) runtimes because of dynamic task spawning and work stealing. Researchers Mia Reitz and Claudia Fohry at the University of Kassel propose a novel scheme that handles SDC for nested fork-join programs efficiently. Their approach replicates the entire computation but only records the task tree structure. When a final result mismatch occurs, the system traverses the tree top-down to pinpoint exactly which tasks are corrupted and need recomputation. Only those tasks are re-executed, reusing results from uncorrupted child tasks. The team implemented the scheme in the Itoyori AMT runtime and found that the time to identify and reprocess affected tasks is negligible. The paper also discusses adapting the scheme to tasks that communicate through futures, extending its applicability beyond fork-join patterns. This work directly tackles the reliability challenges of exascale computing without the heavy overhead of full replication.

Key Points
  • Uses a recorded task tree to identify and recompute only corrupted tasks, not the entire program.
  • Implemented in the Itoyori AMT runtime; identification and reprocessing overhead is negligible.
  • Extensible to tasks using futures, broadening applicability beyond nested fork-join programs.

Why It Matters

Makes exascale supercomputing more reliable by fixing silent errors with minimal performance cost.