Research & Papers

CD-Raft: Reducing the Latency of Distributed Consensus in Cross-Domain Sites

New distributed protocol reduces tail latency by 49% for AI workloads across multiple sites.

Deep Dive

A team of researchers has published a paper on CD-Raft, a novel optimization of the widely-used Raft consensus protocol specifically designed for the high-stakes environment of cross-domain AI computation. As massive AI models require heavy data synchronization across geographically distributed data centers (sites), the latency of achieving consensus on data state becomes a major bottleneck. CD-Raft tackles this by intelligently optimizing the round-trip time (RTT) for read and write operations across domains and strategically positioning the leader node within the network to minimize communication delays.

The team formally verified CD-Raft's correctness using the TLA+ specification language, guaranteeing it maintains the strong consistency required for reliable distributed systems. They built a prototype and evaluated its performance using the standard YCSB (Yahoo! Cloud Serving Benchmark) benchmark with traces simulating real-world workloads. The empirical results are significant: compared to the classic Raft protocol, CD-Raft achieved a 32.90% reduction in average latency and a dramatic 49.24% cut in 99th percentile tail latency. This directly translates to faster and more predictable synchronization for distributed AI training jobs spanning multiple clouds or regions.

Key Points
  • CD-Raft is an optimized Raft protocol that reduces average consensus latency by 32.9% for cross-domain AI workloads.
  • It slashes the critical 99th percentile tail latency—the worst-case delays—by 49.24%, improving system predictability.
  • The protocol's correctness for strong consistency is formally guaranteed using TLA+ specification and validated with YCSB benchmarks.

Why It Matters

Faster data consensus directly accelerates distributed AI training and inference, reducing costs and time-to-market for large models.