Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks

Authors:
Giuseppe Di Fatta;Francesco Blasa;Simone Cafiero;Giancarlo Fortino
Affiliations:
School of Systems Engineering, The University of Reading, Reading, UK;Dipartimento di Informatica, Elettronica e Sistemistica, University of Calabria, Italy;Dipartimento di Informatica, Elettronica e Sistemistica, University of Calabria, Italy;Dipartimento di Informatica, Elettronica e Sistemistica, University of Calabria, Italy
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 25
Cited 0

Epidemic algorithms for replicated database maintenance

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
On the Accuracy and Stablility of Clocks Synchronized by the Network Time Protocol in the Internet System

ACM SIGCOMM Computer Communication Review
Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract)

SCG '94 Proceedings of the tenth annual symposium on Computational geometry
A Chernoff Bound for Random Walks on Expander Graphs

SIAM Journal on Computing
Large-Scale Parallel Data Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
On the origin of power laws in Internet topologies

ACM SIGCOMM Computer Communication Review
Ubiquitous Computing

Computer
Scalable Parallel Clustering for Data Mining on Multicomputers

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Randomized rumor spreading

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Gossip-Based Computation of Aggregate Information

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Gossip-based aggregation in large dynamic networks

ACM Transactions on Computer Systems (TOCS)
Distributed Data Mining in Peer-to-Peer Networks

IEEE Internet Computing
On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks

IEEE/ACM Transactions on Networking (TON)
Top 10 algorithms in data mining

Knowledge and Information Systems
Efficient Fragmentation of Large XML Documents

DEXA '07 Proceedings of the 18th international conference on Database and Expert Systems Applications
A hierarchical control protocol for group-oriented playbacks supported by content distribution networks

Journal of Network and Computer Applications
NP-hardness of Euclidean sum-of-squares clustering

Machine Learning
Approximate Distributed K-Means Clustering over a Peer-to-Peer Network

IEEE Transactions on Knowledge and Data Engineering
Enabling OLAP in mobile environments via intelligent data cube compression techniques

Journal of Intelligent Information Systems
Data clustering: 50 years beyond K-means

Pattern Recognition Letters
Dynamic Load Balancing in Parallel KD-Tree k-Means

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Clustering distributed data streams in peer-to-peer environments

Information Sciences: an International Journal
Least squares quantization in PCM

IEEE Transactions on Information Theory
Randomized gossip algorithms

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.