Clustering High Dimensional Massive Scientific Datasets

Authors:
Ekow J. Otoo;Arie Shoshani;Seung-won Hwang
Affiliations:
-;-;-
Venue:
SSDBM '01 Proceedings of the 13th International Conference on Scientific and Statistical Database Management
Year:
2001

Citing 0
Cited 5

The design of a retrieval technique for high-dimensional data on tertiary storage

ACM SIGMOD Record
Effective Management of Hierarchical Storage Using Two Levels of Data Clustering

MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
Multidimensionality in statistical, OLAP, and scientific databases

Multidimensional databases
Compressing Bitmap Indices by Data Reorganization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Clustering high-dimensional data using an efficient and effective data space reduction

Proceedings of the 14th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: Many scientific applications can benefit from efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(N) at best, and O(N log N) in the worst case, using no more than O(NM) storage, for it to be practical. A parallelized version of the same algorithm should achieve a linear speed-up in processing time with increasing number of processors. We introduce a hybrid algorithm called HyCeltyc, as an approach for clustering massively large high dimensional datasets. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction and selection of significant features on which to cluster the data.