Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging

Authors:
Cheng-Ru Lin;Ming-Syan Chen
Affiliations:
-;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 20
Cited 19

How many clusters are best?—an experiment

Pattern Recognition
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Discovering Internet marketing intelligence through online analytical web usage mining

ACM SIGMOD Record
Clustering techniques for large data sets—from the past to the future

KDD '99 Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Data mining: concepts and techniques

Data mining: concepts and techniques
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Introduction to the Theory of Neural Computation

Introduction to the Theory of Neural Computation
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Data Mining: An Overview from a Database Perspective

IEEE Transactions on Knowledge and Data Engineering
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Clustering Algorithm for Market Basket Data Based on Small Large Ratios

COMPSAC '01 Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development
Hierarchical model-based clustering of large datasets through fractionation and refractionation

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A robust and efficient clustering algorithm based on cohesion self-merging

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering

Using hierarchical clustering for learning theontologies used in recommendation systems

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic clustering and boundary detection algorithm based on adaptive influence function

Pattern Recognition
DIVFRP: An automatic divisive hierarchical clustering method based on the furthest reference points

Pattern Recognition Letters
Image-mapped data clustering: An efficient technique for clustering large data sets

Intelligent Data Analysis
A multi-prototype clustering algorithm

Pattern Recognition
Nonlinear Data Analysis Using a New Hybrid Data Clustering Algorithm

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Enhanced bisecting k-means clustering using intermediate cooperation

Pattern Recognition
Rough-DBSCAN: A fast hybrid density based clustering method for large data sets

Pattern Recognition Letters
A probabilistic relational approach for web document clustering

Information Processing and Management: an International Journal
Cooperative clustering

Pattern Recognition
Clustering of Adolescent Criminal Offenders using Psychological and Criminological Profiles

Proceedings of the 2010 conference on Data Mining for Business Applications
Minimum spanning tree based split-and-merge: A hierarchical clustering method

Information Sciences: an International Journal
A distance based clustering method for arbitrary shaped clusters in large datasets

Pattern Recognition
Hierarchical K-means clustering algorithm based on silhouette and entropy

AICI'11 Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part I
Hybrid agglomerative clustering for large databases: an efficient interactivity approach

AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
A density-based spatial clustering algorithm considering both spatial proximity and attribute similarity

Computers & Geosciences
Multi-scale decomposition of point process data

Geoinformatica
Identifying hidden geospatial resources in catalogues

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Comparing relational and non-relational algorithms for clustering propositional data

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids or the distance between two closest (or farthest) data points. However, all of these measures are vulnerable to outliers and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measure, referred to as cohesion, to measure the intercluster distances. By using this new measure of cohesion, we have designed a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in time linear to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. The time and the space complexities of algorithm CSM are analyzed. As shown by our performance studies, the cohesion-based clustering is very robust and possesses excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently and provide better clustering results than those by prior methods.