DS-means: distributed data stream clustering

  • Authors:
  • Alessio Guerrieri;Alberto Montresor

  • Affiliations:
  • University of Trento, Italy;University of Trento, Italy

  • Venue:
  • Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes DS-means, a novel algorithm for clustering distributed data streams. Given a network of computing nodes, each of them receiving its share of a distributed data stream, our goal is to obtain a common clustering under the following restrictions (i) the number of clusters is not known in advance and (ii) nodes are not allowed to share single points of their datasets, but only aggregate information. A motivating example for DS-means is the decentralized detection of botnets, where a collection of independent ISPs may want to detect common threats, but are unwilling to share their precious users' data. In DS-means, nodes execute a distributed version of K-means on each chunk of data they receive to provide a compact representation of the data of the entire network. Later, X-means is executed on this representation to obtain an estimate of the number of clusters. A number of experiments on both synthetic and real-life datasets show that our algorithm is precise, efficient and robust.