Fast hierarchical clustering and its validation

Authors:
Manoranjan Dash;Huan Liu;Peter Scheuermann;Kian Lee Tan
Affiliations:
Department of Electrical and Computer Engineering, Northwestern University, 2145, Sheridan Road, Evanston, IL;Department of Computer Science and Engineering, Arizona State University, Tempe, AZ;Department of Electrical and Computer Engineering, Northwestern University, 2145, Sheridan Road, Evanston, IL;School of Computing, National University of Singapore, 117543 Singapore
Venue:
Data & Knowledge Engineering
Year:
2003

Citing 12
Cited 6

Computational geometry: an introduction

Computational geometry: an introduction
Algorithms for clustering data

Algorithms for clustering data
The SEQUOIA 2000 storage benchmark

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A multiple-resolution method for edge-centric data clustering

Proceedings of the eighth international conference on Information and knowledge management
Data mining: concepts and techniques

Data mining: concepts and techniques
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases

pPOP: Fast yet accurate parallel hierarchical clustering using partitioning

Data & Knowledge Engineering
Rough clustering of sequential data

Data & Knowledge Engineering
Improving density-based methods for hierarchical clustering of web pages

Data & Knowledge Engineering
Distance based fast hierarchical clustering method for large datasets

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
A distance based clustering method for arbitrary shaped clusters in large datasets

Pattern Recognition
Domain taxonomy learning from text: The subsumption method versus hierarchical clustering

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is the task of grouping similar objects into clusters. A prominent and useful class of algorithm is hierarchical agglomerative clustering (HAC) which iteratively agglomerates the closest pair until all data points belong to one cluster. It outputs a dendrogram showing all N levels of agglomerations where N is the number of objects in the dataset. However, HAC methods have several drawbacks: (1) high time and memory complexities for clustering, and (2) inefficient and inaccurate cluster validation. In this paper we show that these drawbacks can be alleviated by closely studying the dendrogram. Empirical study shows that most HAC algorithms follow a trend where, except for a number of top levels of the dendrogram, all lower levels agglomerate clusters which are very small in size and close in proximity to other clusters. Methods are proposed that exploit this characteristic to reduce the time and memory complexities significantly and to make validation very efficient and accurate. Analyses and experiments show the effectiveness of the proposed method.