DS-means: distributed data stream clustering

Authors:
Alessio Guerrieri;Alberto Montresor
Affiliations:
University of Trento, Italy;University of Trento, Italy
Venue:
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Year:
2012

Citing 8
Cited 0

Epidemic algorithms for replicated database maintenance

PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Gossip-based aggregation in large dynamic networks

ACM Transactions on Computer Systems (TOCS)
BotMiner: clustering analysis of network traffic for protocol- and structure-independent botnet detection

SS'08 Proceedings of the 17th conference on Security symposium
Clustering distributed data streams in peer-to-peer environments

Information Sciences: an International Journal
Least squares quantization in PCM

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes DS-means, a novel algorithm for clustering distributed data streams. Given a network of computing nodes, each of them receiving its share of a distributed data stream, our goal is to obtain a common clustering under the following restrictions (i) the number of clusters is not known in advance and (ii) nodes are not allowed to share single points of their datasets, but only aggregate information. A motivating example for DS-means is the decentralized detection of botnets, where a collection of independent ISPs may want to detect common threats, but are unwilling to share their precious users' data. In DS-means, nodes execute a distributed version of K-means on each chunk of data they receive to provide a compact representation of the data of the entire network. Later, X-means is executed on this representation to obtain an estimate of the number of clusters. A number of experiments on both synthetic and real-life datasets show that our algorithm is precise, efficient and robust.