O-Cluster: Scalable Clustering of Large High Dimensional Data Sets

Authors:
Boriana L. Milenova;Marcos M. Campos
Affiliations:
-;-
Venue:
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Year:
2002

Citing 0
Cited 8

Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
NOCEA: A rule-based evolutionary algorithm for efficient and effective clustering of massive high-dimensional databases

Applied Soft Computing
Discovering personally meaningful places: An interactive clustering approach

ACM Transactions on Information Systems (TOIS)
Towards a digital archive for handwritten paper slips with ethnological contents

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Distributed data mining methodology for clustering and classification model

ICAISC'10 Proceedings of the 10th international conference on Artificial intelligence and soft computing: Part I
Survey: Graph clustering

Computer Science Review
Clustering high dimensional data

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Subspace clustering of high-dimensional data: an evolutionary approach

Applied Computational Intelligence and Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering large data sets of high dimensionality hasalways been a challenge for clustering algorithms. Manyrecently developed clustering algorithms have attemptedto address either handling data sets with a very largenumber of records and/or with a very high number ofdimensions. This paper provides a discussion of theadvantages and limitations of existing algorithms whenthey operate on very large multidimensional data sets. Tosimultaneously overcome both the "curse ofdimensionality" and the scalability problems associatedwith large amounts of data, we propose a new clusteringalgorithm called O-Cluster. O-Cluster combines a novelactive sampling technique with an axis-parallelpartitioning strategy to identify continuous areas of highdensity in the input space. The method operates on alimited memory buffer and requires at most a single scanthrough the data. We demonstrate the high quality of theobtained clustering solutions, their robustness to noise,and O-Cluster's excellent scalability.