Epidemic algorithms for replicated database maintenance
PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
ACM SIGCOMM Computer Communication Review
SCG '94 Proceedings of the tenth annual symposium on Computational geometry
A Chernoff Bound for Random Walks on Expander Graphs
SIAM Journal on Computing
Large-Scale Parallel Data Clustering
IEEE Transactions on Pattern Analysis and Machine Intelligence
On the origin of power laws in Internet topologies
ACM SIGCOMM Computer Communication Review
Computer
Scalable Parallel Clustering for Data Mining on Multicomputers
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
A Data-Clustering Algorithm on Distributed Memory Multiprocessors
Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Gossip-Based Computation of Aggregate Information
FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Gossip-based aggregation in large dynamic networks
ACM Transactions on Computer Systems (TOCS)
Distributed Data Mining in Peer-to-Peer Networks
IEEE Internet Computing
On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks
IEEE/ACM Transactions on Networking (TON)
Top 10 algorithms in data mining
Knowledge and Information Systems
Efficient Fragmentation of Large XML Documents
DEXA '07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Journal of Network and Computer Applications
NP-hardness of Euclidean sum-of-squares clustering
Machine Learning
Approximate Distributed K-Means Clustering over a Peer-to-Peer Network
IEEE Transactions on Knowledge and Data Engineering
Enabling OLAP in mobile environments via intelligent data cube compression techniques
Journal of Intelligent Information Systems
Data clustering: 50 years beyond K-means
Pattern Recognition Letters
Dynamic Load Balancing in Parallel KD-Tree k-Means
CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Clustering distributed data streams in peer-to-peer environments
Information Sciences: an International Journal
Least squares quantization in PCM
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
Hi-index | 0.00 |
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.