From data mining to knowledge discovery: an overview
Advances in knowledge discovery and data mining
Split and Merge EM Algorithm for Improving Gaussian Mixture Density Estimates
Journal of VLSI Signal Processing Systems
Constrained K-means Clustering with Background Knowledge
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Semi-supervised Clustering by Seeding
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Mixture Modeling with Pairwise, Instance-Level Class Constraints
Neural Computation
Penalized Probabilistic Clustering
Neural Computation
A tutorial on spectral clustering
Statistics and Computing
Constrained Clustering: Advances in Algorithms, Theory, and Applications
Constrained Clustering: Advances in Algorithms, Theory, and Applications
Semi-supervised graph clustering: a kernel approach
Machine Learning
Serendipitous learning: learning beyond the predefined label space
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
Two aspects are crucial when constructing any real world supervised classification task: the set of classes whose distinction might be useful for the domain expert, and the set of classifications that can actually be distinguished by the data. Often a set of labels is defined with some initial intuition but these are not the best match for the task. For example, labels have been assigned for land cover classification of the Earth but it has been suspected that these labels are not ideal and some classes may be best split into subclasses whereas others should be merged. This paper formalizes this problem using three ingredients: the existing class labels, the underlying separability in the data, and a special type of input from the domain expert. We require a domain expert to specify an L × L matrix of pairwise probabilistic constraints expressing their beliefs as to whether the L classes should be kept separate, merged, or split. This type of input is intuitive and easy for experts to supply. We then show that the problem can be solved by casting it as an instance of penalized probabilistic clustering (PPC). Our method, Class-Level PPC (CPPC) extends PPC showing how its time complexity can be reduced from O(N2) to O(NL) for the problem of class re-definition. We further extend the algorithm by presenting a heuristic to measure adherence to constraints, and providing a criterion for determining the model complexity (number of classes) for constraint-based clustering. We demonstrate and evaluate CPPC on artificial data and on our motivating domain of land cover classification. For the latter, an evaluation by domain experts shows that the algorithm discovers novel class definitions that are better suited to land cover classification than the original set of labels.