Clustering High Dimensional Massive Scientific Datasets

Authors:
Ekow J. Otoo;Arie Shoshani;Seung-Won Hwang
Affiliations:
Lawrence Berkeley National Laboratory, 1 Cyclotron Road, University of California, Berkeley, CA 94720, USA. ejotoo@lbl.gov;Lawrence Berkeley National Laboratory, 1 Cyclotron Road, University of California, Berkeley, CA 94720, USA. shoshani@lbl.gov;Department of Computer Science, University of Illinois at Urbana-Champaign, 1304 W. Springfield Avenue, Urbana, IL 61801, USA
Venue:
Journal of Intelligent Information Systems
Year:
2001

Citing 18
Cited 0

An efficient algorithm for sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
Sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data clustering: a review

ACM Computing Surveys (CSUR)
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The analysis of a simple k-means clustering algorithm

Proceedings of the sixteenth annual symposium on Computational geometry
Data mining: concepts and techniques

Data mining: concepts and techniques
A note on sampling a tape-file

Communications of the ACM
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Multidimensional Indexing and Query Coordination for Tertiary Storage Management

SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
A Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Automatic Reclustering of Objects in Very Large Databases for High Energy Physics

IDEAS '98 Proceedings of the 1998 International Symposium on Database Engineering & Applications
Grid-Clustering: An Efficient Hierarchical Clustering Method for Very Large Data Sets

ICPR '96 Proceedings of the 13th International Conference on Pattern Recognition - Volume 2

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(MN) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorihm that is linear in the data size.