How many clusters are best?—an experiment
Pattern Recognition
Introduction to the theory of neural computation
Introduction to the theory of neural computation
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
ACM Computing Surveys (CSUR)
Density biased sampling: an improved method for data mining and clustering
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data Mining: An Overview from a Database Perspective
IEEE Transactions on Knowledge and Data Engineering
Efficient and Effective Clustering Methods for Spatial Data Mining
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Clustering Algorithm for Market Basket Data Based on Small Large Ratios
COMPSAC '01 Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development
ROCK: A Robust Clustering Algorithm for Categorical Attributes
ICDE '99 Proceedings of the 15th International Conference on Data Engineering
IEEE Transactions on Knowledge and Data Engineering
Dual Clustering: Integrating Data Clustering over Optimization and Constraint Domains
IEEE Transactions on Knowledge and Data Engineering
Adherence clustering: an efficient method for mining market-basket clusters
Information Systems
Constrained data clustering by depth control and progressive constraint relaxation
The VLDB Journal — The International Journal on Very Large Data Bases
Separation index and partial membership for clustering
Computational Statistics & Data Analysis
Adherence clustering: an efficient method for mining market-basket clusters
Information Systems
Hierarchical K-means clustering algorithm based on silhouette and entropy
AICI'11 Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part I
MOSAIC: a proximity graph approach for agglomerative clustering
DaWaK'07 Proceedings of the 9th international conference on Data Warehousing and Knowledge Discovery
Hi-index | 0.00 |
Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids, or the distance between two closest (or farthest) data points. However, all of these measurements are vulnerable to outliers, and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measurement referred to as cohesion, to measure the inter-cluster distances. By using this new measurement of cohesion, we design a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in linear time to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase, and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. As shown by our performance studies, the cohesion-based clustering is very robust and possesses the excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently, and provide better clustering results than those by prior methods.Index Terms: Data mining, data clustering, hierarchical clustering, partitional clustering