Speeding-Up hierarchical agglomerative clustering in presence of expensive metrics

Authors:
Mirco Nanni
Affiliations:
ISTI-CNR, Pisa, Italy
Venue:
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Year:
2005

Citing 6
Cited 6

Fast hierarchical clustering and other applications of dynamic closest pairs

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Data clustering: a review

ACM Computing Surveys (CSUR)
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Optimal algorithms for complete linkage clustering in d dimensions

Theoretical Computer Science
The First Subquadratic Algorithm for Complete Linkage Clustering

ISAAC '95 Proceedings of the 6th International Symposium on Algorithms and Computation
Optimal Time Bounds for Approximate Clustering

Machine Learning

Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results

Data Mining and Knowledge Discovery
Distance based fast hierarchical clustering method for large datasets

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
Metric and trigonometric pruning for clustering of uncertain data in 2D geometric space

Information Systems
A distance based clustering method for arbitrary shaped clusters in large datasets

Pattern Recognition
Agglomerative hierarchical clustering with constraints: theoretical and empirical results

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Mining temporal patterns in popularity of web items

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In several contexts and domains, hierarchical agglomerative clustering (HAC) offers best-quality results, but at the price of a high complexity which reduces the size of datasets which can be handled. In some contexts, in particular, computing distances between objects is the most expensive task. In this paper we propose a pruning heuristics aimed at improving performances in these cases, which is well integrated in all the phases of the HAC process and can be applied to two HAC variants: single-linkage and complete-linkage. After describing the method, we provide some theoretical evidence of its pruning power, followed by an empirical study of its effectiveness over different data domains, with a special focus on dimensionality issues.