Memory efficient subspace clustering for online data streams

Authors:
Nam Hun Park;Won Suk Lee
Affiliations:
Yonsei University, Seoul, Korea;Yonsei University, Seoul, Korea
Venue:
IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Year:
2008

Citing 9
Cited 0

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Streaming-Data Algorithms for High-Quality Clustering

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Fast and Exact Out-of-Core K-Means Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Mining data streams: a review

ACM SIGMOD Record
A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams

Data & Knowledge Engineering
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Grid-based subspace clustering over data streams

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Subspace clustering over an online multi-dimensional data stream requires to examine all the subsets of its dimensions, so that a huge amount of memory space may be required. To trace the ongoing changes of cluster patterns over an online data stream by a confined memory space, this paper proposes a grid-based subspace clustering algorithm that can utilize the confined memory space effectively. Given an n-dimensional data stream, the on-going distribution statistics of data elements in each one-dimension data space are firstly monitored by a list of grid-cells called a sibling list. Once a grid-cell of a first-level sibling list becomes a dense unit grid-cell, new second-level sibling lists are created as its child nodes in order to trace any cluster in all possible two-dimensional rectangular subspaces. In such a way, a sibling tree grows up to the nth level at most and a k-dimensional subcluster can be found at the kth level of the sibling tree. To utilize the confined space of main memory effectively, only the upper-part of a sibling tree is expanded at all times and the subtrees in the lower part are expanded in turns by various scheduling policies such as round-robin and priority-based. Furthermore, in order to confine the usage of memory space, the size of a unit grid-cell is adaptively minimized such that the result of clustering becomes as accurate as possible at all times. The performance of the proposed method is comparatively analyzed by a number of experiments to identify its various characteristics.