Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces

Authors:
Jianjun Zhou;Jörg Sander
Affiliations:
University of Alberta, Department of Computing Science, Edmonton, Alberta, Canada;University of Alberta, Department of Computing Science, Edmonton, Alberta, Canada
Venue:
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Year:
2003

Citing 8
Cited 6

Algorithms for clustering data

Algorithms for clustering data
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Large Datasets in Arbitrary Metric Spaces

ICDE '99 Proceedings of the 15th International Conference on Data Engineering

Online Hierarchical Clustering in a Data Warehouse Environment

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Multi-step density-based clustering

Knowledge and Information Systems
Fast Single-Link Clustering Method Based on Tolerance Rough Set Model

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Distributed, hierarchical clustering and summarization in sensor networks

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
An efficient clustering algorithm for k-anonymisation

Journal of Computer Science and Technology
Tolerance rough set theory based data summarization for clustering large datasets

Transactions on rough sets XIV

Quantified Score

Hi-index	0.00

Visualization

Abstract

To speed-up clustering algorithms, data summarization methods have been proposed, which first summarize the data set by computing suitable representative objects. Then, a clustering algorithm is applied to these representatives only, and a clustering structure for the whole data set is derived, based on the result for the representatives. Most previous methods are, however, limited in their application domain. They are in general based on sufficient statistics such as the linear sum of a set of points, which assumes that the data is from a vector space. On the other hand, in many important applications, the data is from a metric non-vector space, and only distances between objects can be exploited to construct effective data summarizations. In this paper, we develop a new data summarization method based only on distance information that can be applied directly to non-vector data. An extensive performance evaluation shows that our method is very effective in finding the hierarchical clustering structure of non-vector data using only a very small number of data summarizations, thus resulting in a large reduction of runtime while trading only very little clustering quality.