Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing

Authors:
Hisashi Koga;Tetsuo Ishibashi;Toshinori Watanabe
Affiliations:
University of Electro-Communications, Graduate School of Information Systems, 1-5-1 Chofugaoka, Chofu-si, 182-8585, Tokyo, Japan;University of Electro-Communications, Graduate School of Information Systems, 1-5-1 Chofugaoka, Chofu-si, 182-8585, Tokyo, Japan;University of Electro-Communications, Graduate School of Information Systems, 1-5-1 Chofugaoka, Chofu-si, 182-8585, Tokyo, Japan
Venue:
Knowledge and Information Systems
Year:
2007

Citing 9
Cited 7

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
An Agglomerative Hierarchical Clustering Using Partial Maximum Array and Incremental Similarity Computation Method

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Non-negative matrix factorization for semi-supervised data clustering

Knowledge and Information Systems
SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Knowledge and Information Systems
Bulk construction of dynamic clustered metric trees

Knowledge and Information Systems
Distance based fast hierarchical clustering method for large datasets

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
A distance based clustering method for arbitrary shaped clusters in large datasets

Pattern Recognition
Scalable similarity search of timeseries with variable dimensionality

Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient hierarchical clustering of large high dimensional datasets

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The single linkage method is a fundamental agglomerative hierarchical clustering algorithm. This algorithm regards each point as a single cluster initially. In the agglomeration step, it connects a pair of clusters such that the distance between the nearest members is the shortest. This step is repeated until only one cluster remains. The single linkage method can efficiently detect clusters in arbitrary shapes. However, a drawback of this method is a large time complexity of O(n 2), where n represents the number of data points. This time complexity makes this method infeasible for large data. This paper proposes a fast approximation algorithm for the single linkage method. Our algorithm reduces the time complexity to O(nB) by rapidly finding the near clusters to be connected by Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. Here, B represents the maximum number of points going into a single hash entry and it practically diminishes to a small constant as compared to n for sufficiently large hash tables. Experimentally, we show that (1) the proposed algorithm obtains clustering results similar to those obtained by the single linkage method and (2) it runs faster for large data than the single linkage method.