Clustering high dimensional data streams with representative points

Authors:
Xiujun Wang;Hong Shen
Affiliations:
Department of Computer Science and Technology, University of Science and Technology of China, China;Department of Computer Science and Technology, University of Science and Technology of China, China and School of Computer Science, University of Adelaide, Australia
Venue:
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1
Year:
2009

Citing 7
Cited 0

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A framework for projected clustering of high dimensional data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Subspace Clustering of High Dimensional Data Streams

ICIS '08 Proceedings of the Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a novel algorithm for clustering high dimensional data streams with representative data points. The fixed-size interval partitioning adopted in traditional grid based clustering methods can not capture clusters in each dimension well when they are applied in evolving high dimensional data streams. It may generate unnecessary dense grids which misrepresent clusters in a subspace. To overcome these drawbacks, we quantify each dimension (attribute) of data points separately and use the generated representative data points for each dimension instead of fixed-size intervals. These data points are updated with incoming data points continuously so that they can capture the cluster trends in each dimension more accurately than the fixed-size intervals. Instead of discarding the historical data point as a whole, our algorithm confines data discarding at attribute level with the statistics stored in the representative data points. This enables us to keep useful parts of data points and discard the trivial parts. Experiment results on synthetic and real data sets display the high effectiveness and accuracy of the proposed method.