O-Cluster: Scalable Clustering of Large High Dimensional Data Sets

  • Authors:
  • Boriana L. Milenova;Marcos M. Campos

  • Affiliations:
  • -;-

  • Venue:
  • ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering large data sets of high dimensionality hasalways been a challenge for clustering algorithms. Manyrecently developed clustering algorithms have attemptedto address either handling data sets with a very largenumber of records and/or with a very high number ofdimensions. This paper provides a discussion of theadvantages and limitations of existing algorithms whenthey operate on very large multidimensional data sets. Tosimultaneously overcome both the "curse ofdimensionality" and the scalability problems associatedwith large amounts of data, we propose a new clusteringalgorithm called O-Cluster. O-Cluster combines a novelactive sampling technique with an axis-parallelpartitioning strategy to identify continuous areas of highdensity in the input space. The method operates on alimited memory buffer and requires at most a single scanthrough the data. We demonstrate the high quality of theobtained clustering solutions, their robustness to noise,and O-Cluster's excellent scalability.