Incremental and effective data summarization for dynamic hierarchical clustering

Authors:
Samer Nassar;Jörg Sander;Corrine Cheng
Affiliations:
University of Alberta, Edmonton, Alberta, Canada;University of Alberta, Edmonton, Alberta, Canada;University of Alberta, Edmonton, Alberta, Canada
Venue:
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Year:
2004

Citing 16
Cited 8

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining data streams under block evolution

ACM SIGKDD Explorations Newsletter
Requirements for clustering data streams

ACM SIGKDD Explorations Newsletter
Similarity Search without Tears: The OMNI Family of All-purpose Access Methods

Proceedings of the 17th International Conference on Data Engineering
Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
An Incremental Approach to Building a Cluster Hierarchy

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Streaming-Data Algorithms for High-Quality Clustering

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Elementary Statistics Using Excel, Second Edition

Elementary Statistics Using Excel, Second Edition
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Automatic extraction of clusters from hierarchical clustering representations

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining

Online Hierarchical Clustering in a Data Warehouse Environment

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Exploiting parallelism to support scalable hierarchical clustering

Journal of the American Society for Information Science and Technology
Distance based fast hierarchical clustering method for large datasets

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
Tolerance rough set theory based data summarization for clustering large datasets

Transactions on rough sets XIV
Maintaining gaussian mixture models of data streams under block evolution

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part I
Dynamic incremental data summarization for hierarchical clustering

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
On discovering moving clusters in spatio-temporal data

SSTD'05 Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases
Towards never-ending learning from time series streams

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable for subsequent effective hierarchical cluster analysis. In many real world applications, however, the databases dynamically change due to frequent insertions and deletions, possibly changing the data distribution and clustering structure over time. Completely reapplying both the data summarization and the clustering algorithm to detect the changes in the clustering structure and update the uncovered data patterns following such deletions and insertions is prohibitively expensive for large fast changing databases. In this paper, we propose a new scheme to maintain data bubbles incrementally. By using incremental data bubbles, a high-quality hierarchical clustering is quickly available at any point in time. In our scheme, a quality measure for incremental data bubbles is used to identify data bubbles that do not compress well their underlying data points after certain insertions and deletions. Only these data bubbles are re-built using efficient split and merge operations. An extensive experimental evaluation shows that the incremental data bubbles provide significantly faster data summarization than completely re-building the data bubbles after a certain number of insertions and deletions, and are effective in preserving (and in some cases even improving) the quality of the data summarization.