High-dimensional Semi-supervised Classification via the Fermat Distance
New metric uses unlabeled data exponentially faster than before...
In a new arXiv paper, researchers Ruoxu Tan and Yiming Zang introduce a novel approach to semi-supervised classification for high-dimensional data using the Fermat distance, a density-sensitive metric that naturally encodes cluster structures. The authors propose two classifiers: a weighted k-nearest neighbors (k-NN) classifier and a multidimensional scaling (MDS)-induced classifier. By applying MDS with a large target dimension, they enable linear classifiers to effectively handle complex manifold data, a significant advancement for high-dimensional settings where labeled data is scarce.
The theoretical contributions are substantial. The team derives a sharp lower bound for the expected excess risk within clusters and proves the weighted k-NN classifier using the true Fermat distance is minimax optimal. Crucially, they quantify the utility of unlabeled data by showing the error from estimating the Fermat distance decays exponentially with the pooled sample size—a rate far faster than related rates in the literature. Extensive experiments on synthetic and real datasets demonstrate competitive or superior performance compared to state-of-the-art graph-based semi-supervised classifiers, highlighting the practical viability of their approach.
- Weighted k-NN classifier using Fermat distance achieves minimax optimality for high-dimensional semi-supervised learning.
- Error from estimating the Fermat distance decays exponentially with pooled sample size, outperforming prior rates.
- MDS with large target dimension enables linear classifiers to handle complex manifold data effectively.
Why It Matters
This breakthrough enables efficient learning with limited labeled data, crucial for real-world applications like medical imaging and fraud detection.