Characterization and evaluation of similarity measures for pairs of clusterings

Authors:
Darius Pfitzner;Richard Leibbrandt;David Powers
Affiliations:
Flinders University of South Australia, Department of Computer Science, Engineering and Mathematics, 5042, Bedford Park, SA, Australia;Flinders University of South Australia, Department of Computer Science, Engineering and Mathematics, 5042, Bedford Park, SA, Australia;Flinders University of South Australia, Department of Computer Science, Engineering and Mathematics, 5042, Bedford Park, SA, Australia
Venue:
Knowledge and Information Systems
Year:
2009

Citing 0
Cited 18

Rough Diamonds in Natural Language Learning

RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
The NVI clustering evaluation measure

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Nonnegative Matrix Factorization on Orthogonal Subspace

Pattern Recognition Letters
SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index

Pattern Recognition
A semantic similarity approach to predicting Library of Congress subject headings for social tags

Journal of the American Society for Information Science and Technology
Type level clustering evaluation: new measures and a POS induction case study

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Towards a standard methodology to evaluate internal cluster validity indices

Pattern Recognition Letters
Enhancing grid-density based clustering for high dimensional data

Journal of Systems and Software
An extensive comparative study of cluster validity indices

Pattern Recognition
The problem with kappa

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
VAMO: towards a fully automated malware clustering validity analysis

Proceedings of the 28th Annual Computer Security Applications Conference
Minors as miners: modelling and evaluating ontological and linguistic learning

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
A new overlapping clustering algorithm based on graph theory

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Data weighing mechanisms for clustering ensembles

Computers and Electrical Engineering
OClustR: A new graph-based algorithm for overlapping clustering

Neurocomputing
Usage Profiles: A Process for Discovering Usage Patterns over Web Services and its Application to Service Evolution

International Journal of Web Services Research
Effects of resampling method and adaptation on clustering ensemble efficacy

Artificial Intelligence Review
Ensembles for unsupervised outlier detection: challenges and research questions a position paper

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.01

Visualization

Abstract

In evaluating the results of cluster analysis, it is common practice to make use of a number of fixed heuristics rather than to compare a data clustering directly against an empirically derived standard, such as a clustering empirically obtained from human informants. Given the dearth of research into techniques to express the similarity between clusterings, there is broad scope for fundamental research in this area. In defining the comparative problem, we identify two types of worst-case matches between pairs of clusterings, characterised as independently codistributed clustering pairs and conjugate partition pairs. Desirable behaviour for a similarity measure in either of the two worst cases is discussed, giving rise to five test scenarios in which characteristics of one of a pair of clusterings was manipulated in order to compare and contrast the behaviour of different clustering similarity measures. This comparison is carried out for previously-proposed clustering similarity measures, as well as a number of established similarity measures that have not previously been applied to clustering comparison. We introduce a paradigm apparatus for the evaluation of clustering comparison techniques and distinguish between the goodness of clusterings and the similarity of clusterings by clarifying the degree to which different measures confuse the two. Accompanying this is the proposal of a novel clustering similarity measure, the Measure of Concordance (MoC). We show that only MoC, Powers’s measure, Lopez and Rajski’s measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.