Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Algorithms
Hierarchical clustering of WWW image search results using visual, textual and link information
Proceedings of the 12th annual ACM international conference on Multimedia
Hierarchical Clustering Algorithms for Document Datasets
Data Mining and Knowledge Discovery
Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing
Knowledge and Information Systems
Constrained Clustering: Advances in Algorithms, Theory, and Applications
Constrained Clustering: Advances in Algorithms, Theory, and Applications
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Cluster Analysis
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Computer Science Review
Hi-index | 0.00 |
Hierarchical clustering is extensively used to organize high dimensional objects such as documents and images into a structure which can then be used in a multitude of ways. However, existing algorithms are limited in their application since the time complexity of agglomerative style algorithms can be as much as O(n2log n) where n is the number of objects. Furthermore the computation of similarity between such objects is itself time consuming given they are high dimension and even optimized built in functions found in MATLAB take the best part of a day to handle collections of just 10,000 objects on typical machines. In this paper we explore using angular hashing to hash objects with similar angular distance to the same hash bucket. This allows us to create hierarchies of objects within each hash bucket and to hierarchically cluster the hash buckets themselves. With our formal guarantees on the similarity of objects in the same bucket this leads to an elegant agglomerative algorithm with strong performance bounds. Our experimental results show that not only is our approach thousands of times faster than regular agglomerative algorithms but surprisingly the accuracy of our results is typically as good and can sometimes be substantially better.