Research & Papers

Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

New CUDA kernel handles 8K×8K matrices in 256MB at epsilon=1e-4

Deep Dive

FastSinkhorn, designed by Hao Xiao and published on arXiv (2605.00837), tackles one of machine learning's foundational bottlenecks: optimal transport (OT). The traditional Sinkhorn algorithm is widely used for tasks like color transfer and point cloud alignment, but existing implementations (e.g., Python Optimal Transport library) are either numerically unstable at low regularization or burdened by deep learning framework overhead. Xiao's solution is a lightweight native CUDA kernel that leverages warp-level shuffle reductions and shared-memory tiling to maximize GPU utilization while staying entirely in the log-domain. This allows FastSinkhorn to handle regularization parameters as small as epsilon=10⁻⁴—a regime where standard-domain methods often fail due to underflow or overflow.

On benchmark dense OT problems of size 8192×8192, FastSinkhorn achieves a 12× speedup over the widely-used POT library and 5.9× over GPU-accelerated PyTorch baselines, all while using just 256 MB of GPU memory. The implementation demonstrates practical impact across three real-world tasks: image color transfer, 3D point cloud matching, and convergence analysis. By providing a CUDA-native solver that doesn't depend on deep learning frameworks, FastSinkhorn offers a drop-in replacement for high-performance OT in production systems, data science pipelines, and research codebases that require both speed and numerical stability at scale.

Key Points
  • 12× faster than POT library and 5.9× faster than PyTorch GPU on 8192×8192 dense OT.
  • Log-domain computation handles regularization epsilon down to 10⁻⁴ without numerical failure.
  • Only 256 MB GPU memory required for large-scale OT, enabling use on consumer GPUs.

Why It Matters

FastSinkhorn makes large-scale optimal transport practical on consumer GPUs, accelerating key ML workflows like color transfer and matching.