Research & Papers

DuaLip-GPU Technical Report

Researchers redesign LinkedIn's open-source solver for GPUs, achieving 10x speedups on extreme-scale matching workloads.

Deep Dive

A research team including Gregory Dexter and Aida Rahmattalabi has published a technical report detailing DuaLip-GPU, a major architectural redesign of LinkedIn's open-source DuaLip solver for large-scale linear programs (LPs). These LPs are critical for real-world decision systems involving ranking, allocation, and matching problems that must be solved repeatedly at massive scale. The original DuaLip implementation was built on a CPU-centric Scala/Spark stack and was tightly coupled to specific schemas, limiting both extensibility and its ability to leverage modern accelerators. The new work addresses these limitations head-on by fundamentally rethinking the solver's architecture.

The redesigned DuaLip-GPU introduces an operator-centric programming model where LP formulations are expressed through composable primitives, allowing new problems to be added locally while reusing a shared, optimized core. To harness GPU power, the team developed specialized execution techniques including constraint-aligned sparse data layouts and batched projection kernels tailored for sparse matching constraints. The distributed design is streamlined to communicate only dual variables. Furthermore, the underlying ridge-regularized dual ascent algorithm itself is improved with Jacobi-style row normalization and a continuation scheme for the regularization parameter. The result is a system that delivers at least a 10x speedup on extreme-scale matching workloads compared to the prior CPU-based DuaLip, marking a significant leap in solving efficiency for large-scale operational problems.

Key Points
  • Achieves at least 10x wall-clock speedup over prior CPU-based DuaLip solver on extreme-scale matching workloads
  • Introduces an operator-centric model with composable primitives, decoupling problem specification from the optimization engine for better extensibility
  • Employs GPU-tailored techniques like constraint-aligned sparse layouts and batched projection kernels for sparse matching constraints

Why It Matters

Dramatically faster optimization enables more complex, real-time decision-making in large-scale systems like ad allocation, ride-sharing, and resource matching.