Efficient hierarchical clustering of large high dimensional datasets

Authors:
Sean Gilpin;Buyue Qian;Ian Davidson
Affiliations:
University of California, Davis, Davis, CA, USA;IBM, Yorktown Heights, NY, USA;University of California, Davis, Davis, CA, USA
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 11
Cited 0

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Algorithms

Clustering Algorithms
Hierarchical clustering of WWW image search results using visual, textual and link information

Proceedings of the 12th annual ACM international conference on Multimedia
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing

Knowledge and Information Systems
Constrained Clustering: Advances in Algorithms, Theory, and Applications

Constrained Clustering: Advances in Algorithms, Theory, and Applications
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Cluster Analysis

Cluster Analysis
Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Survey: Graph clustering

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hierarchical clustering is extensively used to organize high dimensional objects such as documents and images into a structure which can then be used in a multitude of ways. However, existing algorithms are limited in their application since the time complexity of agglomerative style algorithms can be as much as O(n2log n) where n is the number of objects. Furthermore the computation of similarity between such objects is itself time consuming given they are high dimension and even optimized built in functions found in MATLAB take the best part of a day to handle collections of just 10,000 objects on typical machines. In this paper we explore using angular hashing to hash objects with similar angular distance to the same hash bucket. This allows us to create hierarchies of objects within each hash bucket and to hierarchically cluster the hash buckets themselves. With our formal guarantees on the similarity of objects in the same bucket this leads to an elegant agglomerative algorithm with strong performance bounds. Our experimental results show that not only is our approach thousands of times faster than regular agglomerative algorithms but surprisingly the accuracy of our results is typically as good and can sometimes be substantially better.