Clustering distributed sensor data streams using local processing and reduced communication

  • Authors:
  • João Gama;Pedro Pereira Rodrigues;Luís Lopes

  • Affiliations:
  • (Correspd. E-mail: jgama@fep.up.pt) LIAAD, University of Porto, Porto, Portugal and Faculty of Economics, University of Porto, Porto, Portugal;LIAAD, University of Porto, Porto, Portugal and Faculty of Sciences, University of Porto, Porto, Portugal and Faculty of Medicine, University of Porto, Porto, Portugal;LIAAD, University of Porto, Porto, Portugal and CRACS - INESC, Porto, Portugal

  • Venue:
  • Intelligent Data Analysis - Ubiquitous Knowledge Discovery
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in. This way, at each point in time, the central site has the global multivariate state of the entire network. To avoid monitoring all possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks, focusing on three outcomes: loss to real centroids, communication prevention, and processing reduction. The experimental work on synthetic data supports our proposal, presenting robustness to a high number of sensors, and the application to real data from physiological sensors exposes the aforementioned advantages of the system.