A study of K-Means-based algorithms for constrained clustering

Authors:
Thiago F. Covões;Eduardo R. Hruschka;Joydeep Ghosh
Affiliations:
University of São Paulo, São Carlos, Brazil and University of Texas, Austin, TX, USA;University of São Paulo, São Carlos, Brazil and University of Texas, Austin, TX, USA;University of Texas, Austin, TX, USA
Venue:
Intelligent Data Analysis
Year:
2013

Citing 18
Cited 0

Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Clustering with Instance-level Constraints

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Integrating constraints and metric learning in semi-supervised clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Non-Redundant Data Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Learning a Mahalanobis Metric from Equivalence Constraints

The Journal of Machine Learning Research
Scalable Clustering Algorithms with Balancing Constraints

Data Mining and Knowledge Discovery
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Revisiting probabilistic models for clustering with pair-wise constraints

Proceedings of the 24th international conference on Machine learning
Top 10 algorithms in data mining

Knowledge and Information Systems
Constrained Clustering: Advances in Algorithms, Theory, and Applications

Constrained Clustering: Advances in Algorithms, Theory, and Applications
K-Means with Large and Noisy Constraint Sets

ECML '07 Proceedings of the 18th European conference on Machine Learning
On the efficiency of evolutionary fuzzy clustering

Journal of Heuristics
Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

The Journal of Machine Learning Research
Learning low-rank kernel matrices for constrained clustering

Neurocomputing
Learning from pairwise constraints by Similarity Neural Networks

Neural Networks
Semi-Supervised Maximum Margin Clustering with Pairwise Constraints

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have partially compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints --Constrained Vector Quantization Error CVQE, its variant named LCVQE, and the Metric Pairwise Constrained K-Means MPCK-Means --are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of more specific new experimental findings are discussed in the paper --e.g., deduced constraints usually do not help finding better data partitions.