Clustering High-Dimensional Data with Low-Order Neighbors

Authors:
Yanchang Zhao;Chengqi Zhang;Yi-Dong Shen
Affiliations:
Univ. of Technology, Sydney, Australia;Univ. of Technology, Sydney, Australia;Chinese Academy of Sciences, China
Venue:
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Year:
2004

Citing 7
Cited 1

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques

Data mining: concepts and techniques
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
AGRID: an efficient algorithm for clustering large high-dimensional datasets

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining

DGCL: an efficient density and grid based clustering algorithm for large spatial database

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Density-based and grid-based clustering are two main clustering approaches. The former is famous for its capability of discovering clusters of various shapes and eliminating noises, while the latter is well known for its high speed. Combination of the two approaches seems to provide better clustering results. To the best of our knowledge, however, all existing algorithms that combine density-based clustering and grid-based clustering take cells as atomic units, in the sense that either all objects in a cell belong to a cluster or no object in the cell belong to any cluster. This requires the cells to be small enough to ensure the fine resolution of results. In high-dimensional spaces, however, the number of cells can be very large when cells are small, which would make the clustering process extremely costly. On the other hand, the number of neighbors of a cell grows exponentially with the dimensionality of datasets, which makes the complexity increase further. In this paper, we present a new approach that takes objects (or points) as the atomic units, so that the restriction of cell size can be relaxed without degrading the resolution of clustering results. In addition, a concept of ith-order neighbors is introduced to avoid considering the exponential number of neighboring cells. By considering only low-order neighbors, our algorithm is very efficient while losing only a little bit of accuracy. Experiments on synthetic and public data show that our algorithm can cluster high-dimensional data effectively and efficiently.