Training Non-Differentiable Networks via Optimal Transport
Achieves 93.4% accuracy on spiking networks, rivaling surrogate-gradient methods
Researchers have long struggled to train neural networks with non‑differentiable components—spiking neurons, quantized layers, discrete routing, or black‑box simulators—because backpropagation fails and surrogate gradients introduce bias. In a new paper, An T. Le unveils PolyStep, a gradient‑free optimizer that relies solely on forward passes. At each step, PolyStep evaluates the loss at structured polytope vertices in a compressed subspace, computes a softmax‑weighted assignment over the resulting cost matrix, and displaces particles toward low‑cost vertices via barycentric projection. This approach corresponds to the one‑sided limit of a regularized optimal transport problem, inheriting its geometric structure while avoiding costly Sinkhorn iterations.
PolyStep delivers remarkable results across challenging architectures. On hard‑LIF spiking networks it hits 93.4% test accuracy—more than 60 percentage points above existing gradient‑free methods and within 4.4 points of the surrogate‑gradient Adam ceiling. It also leads all gradient‑free competitors on int8 quantization, argmax attention, staircase activations, and hard mixture‑of‑experts routing. Scaling to MAX‑SAT problems with up to 1 million variables, PolyStep maintains above 92% clause satisfaction, while evolution strategies drop 8–12 points. In reinforcement learning policy search, it matches OpenAI‑ES on classical control tasks and retains strong performance under integer and binary quantization that cripples gradient‑based approaches. The paper further proves convergence to conservative‑stationary points at rate O(log T/√T) on piecewise‑smooth losses, upgraded to Clarke‑stationary on the headline architectures, matching known zeroth‑order query‑complexity lower bounds.
- PolyStep uses only forward passes, eliminating the need for gradients or surrogate gradient approximations.
- On hard‑LIF spiking networks, it achieves 93.4% test accuracy—outperforming gradient‑free baselines by >60 percentage points.
- Proven convergence at O(log T/√T) to stationary points, matching theoretical lower bounds for forward‑only methods.
Why It Matters
Enables efficient training of next‑gen AI hardware (spiking, quantized) without backprop, unlocking faster, more energy‑efficient models.