pPOP: Fast yet accurate parallel hierarchical clustering using partitioning

Authors:
Manoranjan Dash;Simona Petrutiu;Peter Scheuermann
Affiliations:
School of Computer Engineering, Nanyang Technological University, Blk N4, #2c-85, Nanyang Avenue, Singapore 639798, Singapore;Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208, United States;Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208, United States
Venue:
Data & Knowledge Engineering
Year:
2007

Citing 18
Cited 3

Computational geometry: an introduction

Computational geometry: an introduction
Parallel Algorithms for Hierarchical Clustering and Cluster Validity

IEEE Transactions on Pattern Analysis and Machine Intelligence
A parallel algorithm for record clustering

ACM Transactions on Database Systems (TODS)
The SEQUOIA 2000 storage benchmark

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Parallel algorithms for hierarchical clustering

Parallel Computing
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data clustering: a review

ACM Computing Surveys (CSUR)
Data mining: concepts and techniques

Data mining: concepts and techniques
Parallel programming in OpenMP

Parallel programming in OpenMP
Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses

Journal of Parallel and Distributed Computing
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Fast hierarchical clustering and its validation

Data & Knowledge Engineering
Efficient Yet Accurate Clustering

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
A Scalable Parallel Subspace Clustering Algorithm for Massive Data Sets

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing

Editorial: Large scale instance selection by means of federal instance selection

Data & Knowledge Engineering
An adaptive parallel hierarchical clustering algorithm

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Domain taxonomy learning from text: The subsumption method versus hierarchical clustering

Data & Knowledge Engineering

Quantified Score

Hi-index	0.03

Visualization

Abstract

Hierarchical agglomerative clustering (HAC) is very useful but due to high CPU time and memory complexity its practical use is limited. Earlier, we proposed an efficient partitioning - partially overlapping partitioning (POP) - based on the fact that in HAC small and closely placed clusters are agglomerated initially, and only towards the end larger and distant clusters are agglomerated. Here, we present the parallel version of POP, pPOP. Theoretical analysis shows that, compared to the existing algorithms, pPOP achieves CPU time speed-up and memory scale-down of O(c) without compromising accuracy where c is the number of cells in the partition. A shared memory implementation shows that pPOP outperforms existing algorithms significantly.