A Framework for Clustering Massive-Domain Data Streams

Authors:
Charu Aggarwal
Affiliations:
-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 2

XStreamCluster: an efficient algorithm for streaming XML data clustering

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Content-based crowd retrieval on the real-time web

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we will examine the problem of clustering massive domain data streams. Massive-domain data streams are those in which the number of possible domain values for each attribute are very large and cannot be easily tracked for clustering purposes. Some examples of such streams include IP-address streams, credit-card transaction streams, or streams of sales data over large numbers of items. In such cases, it is well known that even simple stream operations such as counting can be extremely difficult because ofthe difficulty in maintaining summary information over the different discrete values. The task of clustering is significantly more challenging in such cases, since the intermediate statistics for the different clusters cannot be maintained efficiently. In this paper, we propose a method for clustering massive-domain data streams with the use of sketches. We prove probabilistic results which show that a sketch-based clustering method can provide similar results to an infinite-space clustering algorithm with high probability. We present experimental results which validate these theoretical results, and show that it is possible to approximate the behavior of an infinite-space algorithm accurately.