Two approaches to understanding when constraints help clustering

Authors:
Ian Davidson
Affiliations:
University of California - Davis, Davis, CA, USA
Venue:
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2012

Citing 10
Cited 0

The Markov chain Monte Carlo method: an approach to approximate counting and integration

Approximation algorithms for NP-hard problems
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Clustering with Instance-level Constraints

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Path coupling: A technique for proving rapid mixing in Markov chains

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Intelligent clustering with instance-level constraints

Intelligent clustering with instance-level constraints
A probabilistic framework for semi-supervised clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning a Mahalanobis Metric from Equivalence Constraints

The Journal of Machine Learning Research
The complexity of non-hierarchical clustering with instance and cluster level constraints

Data Mining and Knowledge Discovery
From sampling to model counting

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Measuring constraint-set utility for partitional clustering algorithms

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most algorithm work in data mining focuses on designing algorithms to address a learning problem. Here we focus our attention on designing algorithms to determine the ease or difficulty of a problem instance. The area of clustering under constraints has recently received much attention in the data mining community. We can view the constraints as restricting (either directly or indirectly) the search space of a clustering algorithm to just feasible clusterings. However, to our knowledge no work explores methods to count the feasible clusterings or other measures of difficulty nor the importance of these measures. We present two approaches to efficiently characterize the difficulty of satisfying must-link (ML) and cannot-link (CL) constraints: calculating the fractional chromatic polynomial of the constraint graph using LP and approximately counting the number of feasible clusterings using MCMC samplers. We show that these measures are correlated to the classical performance measures of constrained clustering algorithms. From these insights and our algorithms we construct new methods of generating and pruning constraints and empirically demonstrate their usefulness.