Memory efficient subspace clustering for online data streams

  • Authors:
  • Nam Hun Park;Won Suk Lee

  • Affiliations:
  • Yonsei University, Seoul, Korea;Yonsei University, Seoul, Korea

  • Venue:
  • IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Subspace clustering over an online multi-dimensional data stream requires to examine all the subsets of its dimensions, so that a huge amount of memory space may be required. To trace the ongoing changes of cluster patterns over an online data stream by a confined memory space, this paper proposes a grid-based subspace clustering algorithm that can utilize the confined memory space effectively. Given an n-dimensional data stream, the on-going distribution statistics of data elements in each one-dimension data space are firstly monitored by a list of grid-cells called a sibling list. Once a grid-cell of a first-level sibling list becomes a dense unit grid-cell, new second-level sibling lists are created as its child nodes in order to trace any cluster in all possible two-dimensional rectangular subspaces. In such a way, a sibling tree grows up to the nth level at most and a k-dimensional subcluster can be found at the kth level of the sibling tree. To utilize the confined space of main memory effectively, only the upper-part of a sibling tree is expanded at all times and the subtrees in the lower part are expanded in turns by various scheduling policies such as round-robin and priority-based. Furthermore, in order to confine the usage of memory space, the size of a unit grid-cell is adaptively minimized such that the result of clustering becomes as accurate as possible at all times. The performance of the proposed method is comparatively analyzed by a number of experiments to identify its various characteristics.