Fast Hierarchical Clustering Based on Compressed Data and OPTICS

Authors:
Markus M. Breunig;Hans-Peter Kriegel;Jörg Sander
Affiliations:
-;-;-
Venue:
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2000

Citing 5
Cited 2

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification

SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases

Fast Single-Link Clustering Method Based on Tolerance Rough Set Model

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Tolerance rough set theory based data summarization for clustering large datasets

Transactions on rough sets XIV

Quantified Score

Hi-index	0.00

Visualization

Abstract

One way to scale up clustering algorithms is to squash the data by some intelligent compression technique and cluster only the compressed data records. Such compressed data records can e.g. be produced by the BIRCH algorithm. Typically they consist of the sufficient statistics of the form (N, X, X2) where N is the number of points, X is the (vector-)sum, and X2 is the square sum of the points. They can be used directly to speed up k-means type of clustering algorithms, but it is not obvious how to use them in a hierarchical clustering algorithm. Applying a hierarchical clustering algorithm e.g. to the centers of compressed subclusters produces a very weak result. The reason is that hierarchical clustering algorithms are based on the distances between data points and that the interpretaion of the result relies heavily on a correct graphical representation of these distances. In this paper, we introduce a method by which the sufficient statistics (N, X, X2) of sub-clusters can be utilized in the hierarchical clustering method OPTICS. We show how to generate appropriate distance information about compressed data points, and how to adapt the graphical representation of the clustering result. A performance evaluation using OPTICS in combination with BIRCH demonstrates that our approach is extremely efficient (speed-up factors up to 1700) and produces high quality results.