A robust and efficient clustering algorithm based on cohesion self-merging

Authors:
Cheng-Ru Lin;Ming-Syan Chen
Affiliations:
National Taiwan University, Taipei, Taiwan, R.O.C.;National Taiwan University, Taipei, Taiwan, R.O.C.
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 10
Cited 9

How many clusters are best?—an experiment

Pattern Recognition
Introduction to the theory of neural computation

Introduction to the theory of neural computation
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data clustering: a review

ACM Computing Surveys (CSUR)
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data Mining: An Overview from a Database Perspective

IEEE Transactions on Knowledge and Data Engineering
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Clustering Algorithm for Market Basket Data Based on Small Large Ratios

COMPSAC '01 Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering

Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging

IEEE Transactions on Knowledge and Data Engineering
Dual Clustering: Integrating Data Clustering over Optimization and Constraint Domains

IEEE Transactions on Knowledge and Data Engineering
Adherence clustering: an efficient method for mining market-basket clusters

Information Systems
An incremental network for on-line unsupervised classification and topology learning

Neural Networks
Constrained data clustering by depth control and progressive constraint relaxation

The VLDB Journal — The International Journal on Very Large Data Bases
Separation index and partial membership for clustering

Computational Statistics & Data Analysis
Adherence clustering: an efficient method for mining market-basket clusters

Information Systems
Hierarchical K-means clustering algorithm based on silhouette and entropy

AICI'11 Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part I
MOSAIC: a proximity graph approach for agglomerative clustering

DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids, or the distance between two closest (or farthest) data points. However, all of these measurements are vulnerable to outliers, and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measurement referred to as cohesion, to measure the inter-cluster distances. By using this new measurement of cohesion, we design a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in linear time to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase, and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. As shown by our performance studies, the cohesion-based clustering is very robust and possesses the excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently, and provide better clustering results than those by prior methods.Index Terms: Data mining, data clustering, hierarchical clustering, partitional clustering