Research & Papers

Varuna: Enabling Failure-Type Aware RDMA Failover

New system eliminates redundant retransmissions, preserving consistency for non-idempotent operations during network failures.

Deep Dive

A research team led by Xiaoyang Wang has introduced Varuna, a novel RDMA (Remote Direct Memory Access) failover system that fundamentally changes how data centers handle network link failures. Current production systems use a brute-force approach: when a primary RDMA link fails, they retransmit all in-flight requests over a standby backup path. Varuna's key insight is that this "blanket retransmission" is both inefficient and potentially incorrect. It unnecessarily consumes bandwidth by resending requests the responder has already executed (post-failure requests), and for non-idempotent operations—where repeating an action changes the result—duplicate execution can violate application semantics and break transactional consistency.

Varuna solves this by piggybacking a lightweight completion log on every RDMA operation. When a link failure occurs, this log allows the system to deterministically identify which in-flight requests were truly lost (pre-failure) and which were successfully executed by the responder (post-failure). The system then performs a surgical recovery: it retransmits only the pre-failure subset and fetches the return values for the already-completed post-failure requests. Evaluated using synthetic microbenchmarks and end-to-end RDMA TPC-C transactions—a standard database benchmark—Varuna demonstrated impressive results. It incurs a minimal 0.6-10% steady-state latency overhead in realistic applications, slashes recovery retransmission time by 65%, guarantees transactional consistency, and adds zero connectivity rebuild overhead with negligible extra memory use during failover.

Key Points
  • Eliminates 65% of recovery retransmission time by avoiding redundant resends of already-executed requests.
  • Preserves transactional consistency for non-idempotent operations, preventing semantic violations from duplicate execution.
  • Adds only 0.6-10% steady-state latency overhead and zero connectivity rebuild overhead with lightweight completion logs.

Why It Matters

Enables faster, more reliable cloud and AI infrastructure by making critical network failover both smarter and more correct.