SMAVE algorithm slashes dimension reduction runtime by orders of magnitude
New Riemannian optimization method for high-dimensional regression runs 10x faster than RMAVE.
Thibault Pautrel and François Portier introduce SMAVE (Stochastic Manifold Approximation and Variance Estimation), a novel algorithm for sufficient dimension reduction (SDR) designed to handle high-dimensional regression efficiently. SDR projects covariates onto a low-dimensional subspace that preserves the conditional mean of the response, but existing gradient-based methods suffer from the curse of dimensionality (ambient-space approaches) or quadratic per-iteration costs in sample size (localized projections). SMAVE addresses this by recasting the empirical Minimum Average Variance Estimation (MAVE) criterion as a smooth maximization on the Stiefel manifold—a mathematical object representing collections of orthonormal bases. The algorithm leverages a closed-form Riemannian gradient and uses sparse nearest-neighbor localization in the projected low-dimensional space, drastically reducing computational overhead. The authors prove almost-sure convergence for a simplified version and derive a non-asymptotic rate matching standard stochastic first-order methods.
Empirically, SMAVE matches or improves on the state-of-the-art RMAVE algorithm for synthetic subspace recovery across moderate-to-high ambient dimensions, while running at orders of magnitude lower runtime (e.g., seconds vs. hours on real datasets). On four benchmark datasets, SMAVE uniformly outperforms the popular Outer Product of Gradients (OPG) method and is competitive with or exceeds RMAVE's accuracy. This breakthrough makes SDR practical for large-scale, high-dimensional machine learning tasks where previous methods were computationally prohibitive. The code and data are available via the arXiv submission (arXiv:2606.00413).
- Combines Riemannian optimization on the Stiefel manifold with sparse nearest-neighbor localization for efficient dimension reduction.
- Matches RMAVE's subspace recovery accuracy while running orders of magnitude faster (seconds vs. hours on real datasets).
- Proven convergence guarantees: almost-sure convergence and non-asymptotic rate matching standard first-order stochastic methods.
Why It Matters
Enables scalable sufficient dimension reduction for high-dimensional data, accelerating machine learning pipelines by factors of 10x–100x speed.