Approximate Distributed K-Means Clustering over a Peer-to-Peer Network

Authors:
Souptik Datta;Chris Giannella;Hillol Kargupta
Affiliations:
University of Maryland, Baltimore;Loyola College, Baltimore;University of Maryland, Baltimore
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2009

Citing 0
Cited 11

Asynchronous peer-to-peer data mining with stochastic gradient descent

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Text clustering for peer-to-peer networks with probabilistic guarantees

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Distributed troubleshooting of web sessions using clustering

TMA'12 Proceedings of the 4th international conference on Traffic Monitoring and Analysis
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks

Journal of Parallel and Distributed Computing
A general scalable and accurate decentralized level monitoring method for large-scale dynamic service provision in hybrid clouds

Future Generation Computer Systems
Locating communities on graphs with variations in community sizes

The Journal of Supercomputing
Effects-based feature identification for network intrusion detection

Neurocomputing
Evolutionary k-means for distributed data sets

Neurocomputing
Achieving Energy Conservation by Cluster Based Data Aggregation in Wireless Sensor Networks

Wireless Personal Communications: An International Journal
GoSCAN: Decentralized scalable data clustering

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.