Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams

  • Authors:
  • Nam Hun Park;Won Suk Lee

  • Affiliations:
  • Department of Computer Science, Yonsei University, 134 Shinchondong Seodaemungu, Seoul 120-749, Republic of Korea;Department of Computer Science, Yonsei University, 134 Shinchondong Seodaemungu, Seoul 120-749, Republic of Korea

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.02

Visualization

Abstract

To effectively trace the clusters of recently generated data elements in an on-line data stream, a sibling list and a cell tree are proposed in this paper. Initially, the multi-dimensional data space of a data stream is partitioned into mutually exclusive equal-sized grid-cells. Each grid-cell monitors the recent distribution statistics of data elements within its range. The old distribution statistics of each grid-cell are diminished by a predefined decay rate as time goes by, so that the effect of the obsolete information on the current result of clustering can be eliminated without maintaining any data element physically. Given a partitioning factor h, a dense grid-cell is partitioned into h equal-size smaller grid-cells. Such partitioning is continued until a grid-cell becomes the smallest one called a unit grid-cell. Conversely, a set of consecutive sparse grid-cells can be merged into a single grid-cell. A sibling list is a structure to manage the set of all grid-cells in a one-dimensional data space and it acts as an index for locating a specific grid-cell. Upon creating a dense unit grid-cell on a one-dimensional data space, a new sibling list for another dimension is created as a child of the grid-cell. In such a way, a cell tree is created. By repeating this process, a multi-dimensional dense unit grid-cell is identified by a path of a cell tree. Furthermore, in order to confine the usage of memory space, the size of a unit grid-cell is adaptively minimized such that the result of clustering becomes as accurate as possible at all times. The proposed method is comparatively analyzed by a series of experiments to identify its various characteristics.