Clustering distributed sensor data streams using local processing and reduced communication

Authors:
João Gama;Pedro Pereira Rodrigues;Luís Lopes
Affiliations:
(Correspd. E-mail: jgama@fep.up.pt) LIAAD, University of Porto, Porto, Portugal and Faculty of Economics, University of Porto, Porto, Portugal;LIAAD, University of Porto, Porto, Portugal and Faculty of Sciences, University of Porto, Porto, Portugal and Faculty of Medicine, University of Porto, Porto, Portugal;LIAAD, University of Porto, Porto, Portugal and CRACS - INESC, Porto, Portugal
Venue:
Intelligent Data Analysis - Ubiquitous Knowledge Discovery
Year:
2011

Citing 22
Cited 2

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Approximation algorithms for geometric problems

Approximation algorithms for NP-hard problems
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Streaming-Data Algorithms for High-Quality Clustering

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Statistical grid-based clustering over data streams

ACM SIGMOD Record
Region streams: functional macroprogramming for sensor networks

DMSN '04 Proceeedings of the 1st international workshop on Data management for sensor networks: in conjunction with VLDB 2004
Discretization from data streams: applications to histograms and data mining

Proceedings of the 2006 ACM symposium on Applied computing
Distributed Data Mining in Peer-to-Peer Networks

IEEE Internet Computing
Online outlier detection in sensor data using non-parametric models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
EnviroSuite: An environmentally immersive programming framework for sensor networks

ACM Transactions on Embedded Computing Systems (TECS)
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Evaluating algorithms that learn from data streams

Proceedings of the 2009 ACM symposium on Applied Computing
A system for analysis and prediction of electricity-load streams

Intelligent Data Analysis - Knowledge Discovery from Data Streams
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory
A survey on sensor networks

IEEE Communications Magazine

Data stream clustering: A survey

ACM Computing Surveys (CSUR)
Light-weight Online Predictive Data Aggregation for Wireless Sensor Networks

Proceedings of Workshop on Machine Learning for Sensory Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in. This way, at each point in time, the central site has the global multivariate state of the entire network. To avoid monitoring all possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks, focusing on three outcomes: loss to real centroids, communication prevention, and processing reduction. The experimental work on synthetic data supports our proposal, presenting robustness to a high number of sensors, and the application to real data from physiological sensors exposes the aforementioned advantages of the system.