K-means clustering versus validation measures: a data-distribution perspective

Authors:
Hui Xiong;Junjie Wu;Jian Chen
Affiliations:
Management Science and Information Systems Department, Rutgers Business School, Rutgers University, Newark, NJ;School of Economics and Management, Beihang University, Beijing, China;Research Center for Contemporary Management, Key Research Institute of Humanities and Social Sciences at Universities, School of Economics and Management, Tsinghua University, Beijing, China
Venue:
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Year:
2009

Citing 24
Cited 10

Algorithms for clustering data

Algorithms for clustering data
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Applied numerical linear algebra

Applied numerical linear algebra
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Information Retrieval

Information Retrieval
Cluster validity methods: part I

ACM SIGMOD Record
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Using Self-Similarity to Cluster Large Data Sets

Data Mining and Knowledge Discovery
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Graph-based hierarchical conceptual clustering

The Journal of Machine Learning Research
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Kernel Principle Component Analysis in Pixels Clustering

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
K-means clustering versus validation measures: a data distribution perspective

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Enhancing the Effectiveness of Clustering with Spectra Analysis

IEEE Transactions on Knowledge and Data Engineering
Clustering Using a Similarity Measure Based on Shared Near Neighbors

IEEE Transactions on Computers
Comparing dimension reduction techniques for document clustering

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Some new indexes of cluster validity

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Genetic K-means algorithm

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Robust clustering by pruning outliers

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Two-level k-means clustering algorithm for k-τ relationship establishment and linear-time classification

Pattern Recognition
Validation of overlapping clustering: A random clustering perspective

Information Sciences: an International Journal
Classification of BMD and ADHD patients using their EEG signals

Expert Systems with Applications: An International Journal
α-clusterable sets

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
From cluster ensemble to structure ensemble

Information Sciences: an International Journal
An architecture for component-based design of representative-based clustering algorithms

Data & Knowledge Engineering
Towards enhancing centroid classifier for text classification-A border-instance approach

Neurocomputing
Clustering interval data through kernel-induced feature space

Journal of Intelligent Information Systems
Ranking and selection of unsupervised learning marketing segmentation

Knowledge-Based Systems
Clustering in extreme learning machine feature space

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied "true" cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in "true" cluster sizes (e.g., CV 1.0), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in "true" cluster sizes (e.g., CV