Developer Tools

PyTorch fixes D-state hang in torchrun elastic agent

Workers stuck in kernel D-state no longer freeze your training job indefinitely.

Deep Dive

PyTorch's torchrun elastic agent had a dangerous bug: when training workers entered D-state (uninterruptible sleep, common during NCCL/GPU/RDMA operations), the agent would hang forever. The root cause was that `MultiprocessContext._close` and `SubprocessContext._close` called unbounded `proc.join()` and `proc.wait()`. Since D-state processes ignore even SIGKILL, these calls never returned, keeping the entire supervisor launcher wedged and GPU slots pinned. Operators had to manually kill containers, wasting time and resources.

Now, Claude-authored patch #185414 fixes this with two changes. First, both `proc.join()` and `proc.wait()` now honor the same timeout (default 1 second) used elsewhere in `_close`. On timeout, the unkillable PID is logged with a clear message (SIGKILL signal name and advice to recycle the host), and the agent continues its exit. Second, D-state detection (added in a prior change) now actually escalates: when `_check_d_state_timeout` fires, `_remaining_restarts` is set to 0 before returning UNHEALTHY, forcing `_invoke_run` to exit immediately rather than retry. New unit tests (`BoundedCloseTest`, `LocalElasticAgentDStateTest`) confirm the agent now exits within 5 seconds instead of hanging. The caveat: host recycling is still required to free GPU/NIC resources, but the fix ensures the training job itself can terminate cleanly.

Key Points
  • Workers stuck in D-state (uninterruptible sleep) caused torchrun agent to hang on unbounded process joins.
  • Fix adds a 1-second timeout to proc.join/wait, logging the unkillable PID and continuing exit.
  • D-state detection now sets _remaining_restarts=0 before returning UNHEALTHY, forcing immediate exit instead of retry.

Why It Matters

Prevents distributed training jobs from stalling on unkillable GPU processes, saving operator time and compute.