Federated Hierarchical Clustering with Automatic Selection of Optimal Cluster Numbers
New framework automatically finds the optimal number of clusters in privacy-protected, distributed data, tackling a core limitation of federated learning.
A team of researchers has introduced Fed-k*-HC, a groundbreaking federated clustering (FC) framework designed to overcome a fundamental flaw in existing methods. Current FC approaches assume a known and uniform number of clusters across client data, a condition rarely met in reality where cluster counts are unknown and sizes are naturally imbalanced. Fed-k*-HC automates the discovery of the optimal cluster number (k*) by having each client locally generate compact 'micro-subclusters' and send only their prototypes to a central server. This preserves privacy while providing the necessary data distribution insights.
On the server, a novel density-based hierarchical merging algorithm takes over. It progressively combines these micro-subclusters based on their proximity and density relationships. This process is self-terminating; it continues merging until the natural separation in the data is found, thereby automatically revealing the true k*. This design is robust against the information loss inherent in privacy-preserving data transmission and can handle clusters of diverse shapes and sizes, moving beyond simplistic spherical cluster assumptions.
Extensive testing on diverse datasets has demonstrated Fed-k*-HC's superior capability to accurately explore the proper number of clusters compared to previous federated methods. By solving the 'unknown k' problem, the framework unlocks more practical and accurate unsupervised learning across sectors like healthcare, finance, and IoT, where data is distributed and sensitive. It represents a significant step toward making federated learning viable for complex, real-world exploratory data analysis without compromising on data privacy or requiring unrealistic prior knowledge.
- Automates optimal cluster count: The Fed-k*-HC framework's core innovation is automatically determining the correct number of clusters (k*) from the data itself, eliminating a critical manual guesswork step in federated learning.
- Privacy-preserving hierarchical design: Clients only share prototypes of locally generated 'micro-subclusters'. A server-side density-based merging process then reconstructs the global data distribution hierarchy to find k* without accessing raw data.
- Handles real-world data imbalance: The density-based merging approach is specifically designed to explore clusters of varying sizes and shapes, addressing a key shortfall of methods that assume uniformly sized, spherical clusters.
Why It Matters
Enables accurate data pattern discovery across siloed, sensitive datasets (e.g., hospitals, banks) without sharing raw data, moving federated learning from theory to practice.