Redefining class definitions using constraint-based clustering: an application to remote sensing of the earth's surface

Authors:
Dan R. Preston;Carla E. Brodley;Roni Khardon;Damien Sulla-Menashe;Mark Friedl
Affiliations:
Tufts University, Medford, MA, USA;Tufts University, Medford, MA, USA;Tufts University, Medford, MA, USA;Boston University, Boston, MA, USA;Boston University, Boston, MA, USA
Venue:
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2010

Citing 9
Cited 1

From data mining to knowledge discovery: an overview

Advances in knowledge discovery and data mining
Split and Merge EM Algorithm for Improving Gaussian Mixture Density Estimates

Journal of VLSI Signal Processing Systems
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Mixture Modeling with Pairwise, Instance-Level Class Constraints

Neural Computation
Penalized Probabilistic Clustering

Neural Computation
A tutorial on spectral clustering

Statistics and Computing
Constrained Clustering: Advances in Algorithms, Theory, and Applications

Constrained Clustering: Advances in Algorithms, Theory, and Applications
Semi-supervised graph clustering: a kernel approach

Machine Learning

Serendipitous learning: learning beyond the predefined label space

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two aspects are crucial when constructing any real world supervised classification task: the set of classes whose distinction might be useful for the domain expert, and the set of classifications that can actually be distinguished by the data. Often a set of labels is defined with some initial intuition but these are not the best match for the task. For example, labels have been assigned for land cover classification of the Earth but it has been suspected that these labels are not ideal and some classes may be best split into subclasses whereas others should be merged. This paper formalizes this problem using three ingredients: the existing class labels, the underlying separability in the data, and a special type of input from the domain expert. We require a domain expert to specify an L × L matrix of pairwise probabilistic constraints expressing their beliefs as to whether the L classes should be kept separate, merged, or split. This type of input is intuitive and easy for experts to supply. We then show that the problem can be solved by casting it as an instance of penalized probabilistic clustering (PPC). Our method, Class-Level PPC (CPPC) extends PPC showing how its time complexity can be reduced from O(N2) to O(NL) for the problem of class re-definition. We further extend the algorithm by presenting a heuristic to measure adherence to constraints, and providing a criterion for determining the model complexity (number of classes) for constraint-based clustering. We demonstrate and evaluate CPPC on artificial data and on our motivating domain of land cover classification. For the latter, an evaluation by domain experts shows that the algorithm discovers novel class definitions that are better suited to land cover classification than the original set of labels.