A Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets

Authors:
Harsha S. Nagesh;Alok Choudhary;Sanjay Goil
Affiliations:
-;-;-
Venue:
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Year:
2000

Citing 11
Cited 17

Algorithms for clustering data

Algorithms for clustering data
Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
A new inversive congruential pseudorandom number generator with power of two modulus

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Parallel algorithms for hierarchical clustering

Parallel Computing
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD

Clustering High Dimensional Massive Scientific Datasets

Journal of Intelligent Information Systems
Parallel Fuzzy c-Means Clustering for Large Data Sets

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Substructure Clustering on Sequential 3d Object Datasets

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Subspace clustering for high dimensional categorical data

ACM SIGKDD Explorations Newsletter
Unsupervised anomaly detection in network intrusion detection using clusters

ACSC '05 Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38
Clustering high-dimensional data using an efficient and effective data space reduction

Proceedings of the 14th ACM international conference on Information and knowledge management
pPOP: Fast yet accurate parallel hierarchical clustering using partitioning

Data & Knowledge Engineering
Message Passing Clustering (MPC): a knowledge-based framework for clustering under biological constraints

International Journal of Data Mining and Bioinformatics
High-Dimensional Clustering Method for High Performance Data Mining

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Image-mapped data clustering: An efficient technique for clustering large data sets

Intelligent Data Analysis
DGDCT: a distributed grid-density based algorithm for intrinsic cluster detection over massive spatial data

ICDCN'08 Proceedings of the 9th international conference on Distributed computing and networking
Parallelization of a hierarchical data clustering algorithm using OpenMP

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
DisClus: a distributed clustering technique over high resolution satellite data

ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
Obtaining biclusters in microarrays with population-based heuristics

EuroGP'06 Proceedings of the 2006 international conference on Applications of Evolutionary Computing
A grid-density based technique for finding clusters in satellite image

Pattern Recognition Letters
A new cell-based clustering method for high-dimensional data mining applications

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I
Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA)

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a data-mining problem, which finds dense regions in a sparse multi-dimensional data set. The attribute values and ranges of these regions characterize the clusters. Clustering algorithms need to scale with the data base size and with the large dimensionality of the data set. Further, these algorithms need to explore the embedded clusters in a subspace of a high dimensional space. However, the time complexity of the algorithm to explore clusters in subspaces is exponential in the dimensionality of the data and is thus extremely compute intensive. Thus, parallelization is the choice for discovering clusters for large data sets. In this paper, we present a scalable parallel subspace-clustering algorithm, which has both data and task parallelism embedded in it. We also formulate the technique of adaptive grids and present a truly unsupervised clustering algorithm requiring no user inputs. Our implementation shows near linear speedups with negligible communication overheads. The use of adaptive grids results in two orders of magnitude improvement in the computation time of our serial algorithm over current methods with much better quality of clustering. Performance results on both real and synthetic data sets with very large number of dimensions on a 16 node IBM SP2 demonstrate our algorithm to be a practical and scalable clustering technique.