Algorithms for clustering data
Algorithms for clustering data
OHSUMED: an interactive retrieval evaluation and new large test collection for research
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Applied numerical linear algebra
Applied numerical linear algebra
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
WebACE: a Web agent for document categorization and exploration
AGENTS '98 Proceedings of the second international conference on Autonomous agents
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Information Retrieval
Cluster validity methods: part I
ACM SIGMOD Record
COOLCAT: an entropy-based algorithm for categorical clustering
Proceedings of the eleventh international conference on Information and knowledge management
Using Self-Similarity to Cluster Large Data Sets
Data Mining and Knowledge Discovery
Distance-based outliers: algorithms and applications
The VLDB Journal — The International Journal on Very Large Data Bases
Graph-based hierarchical conceptual clustering
The Journal of Machine Learning Research
Kernel Principle Component Analysis in Pixels Clustering
WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Introduction to Data Mining, (First Edition)
Introduction to Data Mining, (First Edition)
K-means clustering versus validation measures: a data distribution perspective
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Enhancing the Effectiveness of Clustering with Spectra Analysis
IEEE Transactions on Knowledge and Data Engineering
Clustering Using a Similarity Measure Based on Shared Near Neighbors
IEEE Transactions on Computers
Comparing dimension reduction techniques for document clustering
AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Some new indexes of cluster validity
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Robust clustering by pruning outliers
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Validation of overlapping clustering: A random clustering perspective
Information Sciences: an International Journal
Classification of BMD and ADHD patients using their EEG signals
Expert Systems with Applications: An International Journal
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
From cluster ensemble to structure ensemble
Information Sciences: an International Journal
An architecture for component-based design of representative-based clustering algorithms
Data & Knowledge Engineering
Clustering interval data through kernel-induced feature space
Journal of Intelligent Information Systems
Ranking and selection of unsupervised learning marketing segmentation
Knowledge-Based Systems
Clustering in extreme learning machine feature space
Neurocomputing
Hi-index | 0.00 |
K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied "true" cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in "true" cluster sizes (e.g., CV 1.0), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in "true" cluster sizes (e.g., CV