Research & Papers

AdaGraph: Graph-Native Clustering Beats Curse of Dimensionality

Replaces distance metrics with graph topology to cluster data in up to 5,000 dimensions.

Deep Dive

AdaGraph, introduced by Ahmed Elmahdi as part of a new Structure-Centric Machine Learning (SC-ML) paradigm, tackles one of machine learning's oldest problems: the curse of dimensionality. Traditional clustering algorithms rely on Euclidean distance metrics, which become uninformative as dimensions increase, making them unreliable for high-dimensional genomics, materials science, or text data. AdaGraph instead works strictly within the topology of a k-nearest neighbors (kNN) graph, preserving meaningful relational structure regardless of dimensionality. It requires no a priori specification of the number of clusters, handles noise natively, and scales via the SLCD (Sample-Learn-Calibrate-Deploy) framework. For unsupervised tuning, it pairs with Graph-SCOPE, a topology-based cluster validity index that achieves Kendall tau ≥0.92 with ground truth quality across all tested dimensions—far outperforming Silhouette (tau≈0.46), Davies-Bouldin, and Calinski-Harabasz.

In benchmarks on 10 synthetic datasets spanning from 10 to 5,000 dimensions, Graph-SCOPE achieved a mean adjusted Rand index (ARI) of 0.900 and correctly selected the number of clusters on 9 out of 10 datasets. AdaGraph's real-world impact is validated across three scientific domains. In hepatocellular carcinoma gene co-expression analysis (GSE14520: 10,000 genes, 488 patients) with no dimensionality reduction, it identified condition-specific gene modules that WGCNA, ICA, NMF, and Spectral Biclustering failed to resolve. In natural language processing on the 20NG-6cat dataset, AdaGraph achieved ARI=0.751 versus HDBSCAN's 0.464—a 62% relative improvement. In materials science, it achieved the highest Graph-SCOPE on clustering of superconductors (145-dimensional Magpie features), perovskites, and JARVIS-DFT materials, proving its versatility across domains.

Key Points
  • Operates entirely on kNN graph topology instead of Euclidean distances, overcoming the curse of dimensionality for d up to 5000.
  • Outperforms HDBSCAN by 62% on text clustering (ARI 0.751 vs 0.464) and correctly identifies cluster count 9/10 times on synthetic benchmarks.
  • Validated on real-world gene expression, NLP, and materials science datasets, discovering patterns missed by WGCNA, ICA, NMF, and Spectral Biclustering.

Why It Matters

Enables accurate high-dimensional clustering without dimensionality reduction for genomics, materials discovery, and NLP.