Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams

Authors:
Nam Hun Park;Won Suk Lee
Affiliations:
Department of Computer Science, Yonsei University, 134 Shinchondong Seodaemungu, Seoul 120-749, Republic of Korea;Department of Computer Science, Yonsei University, 134 Shinchondong Seodaemungu, Seoul 120-749, Republic of Korea
Venue:
Data & Knowledge Engineering
Year:
2007

Citing 13
Cited 7

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Sliding-window filtering: an efficient algorithm for incremental mining

Proceedings of the tenth international conference on Information and knowledge management
The Art of Computer Programming Volumes 1-3 Boxed Set

The Art of Computer Programming Volumes 1-3 Boxed Set
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Online Data Mining for Co-Evolving Time Sequences

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Streaming-Data Algorithms for High-Quality Clustering

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Statistical grid-based clustering over data streams

ACM SIGMOD Record
Mining data streams: a review

ACM SIGMOD Record
On-line EM Algorithm for the Normalized Gaussian Network

Neural Computation
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Grid-based subspace clustering over data streams

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Memory efficient subspace clustering for online data streams

IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
A dynamic data granulation through adjustable fuzzy clustering

Pattern Recognition Letters
A coarse-grain grid-based subspace clustering method for online multi-dimensional data streams

Proceedings of the 17th ACM conference on Information and knowledge management
Incremental clustering of dynamic data streams using connectivity based representative points

Data & Knowledge Engineering
Efficiently tracing clusters over high-dimensional on-line data streams

Data & Knowledge Engineering
Data stream clustering: A survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.02

Visualization

Abstract

To effectively trace the clusters of recently generated data elements in an on-line data stream, a sibling list and a cell tree are proposed in this paper. Initially, the multi-dimensional data space of a data stream is partitioned into mutually exclusive equal-sized grid-cells. Each grid-cell monitors the recent distribution statistics of data elements within its range. The old distribution statistics of each grid-cell are diminished by a predefined decay rate as time goes by, so that the effect of the obsolete information on the current result of clustering can be eliminated without maintaining any data element physically. Given a partitioning factor h, a dense grid-cell is partitioned into h equal-size smaller grid-cells. Such partitioning is continued until a grid-cell becomes the smallest one called a unit grid-cell. Conversely, a set of consecutive sparse grid-cells can be merged into a single grid-cell. A sibling list is a structure to manage the set of all grid-cells in a one-dimensional data space and it acts as an index for locating a specific grid-cell. Upon creating a dense unit grid-cell on a one-dimensional data space, a new sibling list for another dimension is created as a child of the grid-cell. In such a way, a cell tree is created. By repeating this process, a multi-dimensional dense unit grid-cell is identified by a path of a cell tree. Furthermore, in order to confine the usage of memory space, the size of a unit grid-cell is adaptively minimized such that the result of clustering becomes as accurate as possible at all times. The proposed method is comparatively analyzed by a series of experiments to identify its various characteristics.