An efficient algorithm for sequential random sampling
ACM Transactions on Mathematical Software (TOMS)
ACM Transactions on Mathematical Software (TOMS)
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
ACM Computing Surveys (CSUR)
Density biased sampling: an improved method for data mining and clustering
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The analysis of a simple k-means clustering algorithm
Proceedings of the sixteenth annual symposium on Computational geometry
Data mining: concepts and techniques
Data mining: concepts and techniques
A note on sampling a tape-file
Communications of the ACM
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Multidimensional Indexing and Query Coordination for Tertiary Storage Management
SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
A Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Automatic Reclustering of Objects in Very Large Databases for High Energy Physics
IDEAS '98 Proceedings of the 1998 International Symposium on Database Engineering & Applications
Grid-Clustering: An Efficient Hierarchical Clustering Method for Very Large Data Sets
ICPR '96 Proceedings of the 13th International Conference on Pattern Recognition - Volume 2
Hi-index | 0.01 |
Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(MN) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorihm that is linear in the data size.