The problem with kappa

Authors:
David M. W. Powers
Affiliations:
CSEM Flinders University
Venue:
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2012

Citing 19
Cited 1

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Bias reduction in skewed binary classification with Bayesian neural networks

Neural Networks
Robust Classification for Imprecise Environments

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The kappa statistic: a second look

Computational Linguistics
ROC `n' Rule Learning—Towards a Better Understanding of Covering Algorithms

Machine Learning
Data Mining

Data Mining
Measuring Word Alignment Quality for Statistical Machine Translation

Computational Linguistics
Comparison of classification accuracy using Cohen's Weighted Kappa

Expert Systems with Applications: An International Journal
About the relationship between ROC curves and Cohen's kappa

Engineering Applications of Artificial Intelligence
Characterization and evaluation of similarity measures for pairs of clusterings

Knowledge and Information Systems
Evaluation Evaluation

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
The present use of statistics in the evaluation of NLP parsers

NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Measuring classifier performance: a coherent alternative to the area under the ROC curve

Machine Learning
Inequalities between multi-rater kappas

Advances in Data Analysis and Classification
A Formal Proof of a Paradox Associated with Cohen’s Kappa

Journal of Classification
Cohen's linearly weighted kappa is a weighted average

Advances in Data Analysis and Classification
Robust induction of parts-of-speech in child-directed language by co-clustering of words and contexts

ROBUS-UNSUP '12 Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP

A computationally and cognitively plausible model of supervised and unsupervised learning

BICS'13 Proceedings of the 6th international conference on Advances in Brain Inspired Cognitive Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is becoming clear that traditional evaluation measures used in Computational Linguistics (including Error Rates, Accuracy, Recall, Precision and F-measure) are of limited value for unbiased evaluation of systems, and are not meaningful for comparison of algorithms unless both the dataset and algorithm parameters are strictly controlled for skew (Prevalence and Bias). The use of techniques originally designed for other purposes, in particular Receiver Operating Characteristics Area Under Curve, plus variants of Kappa, have been proposed to fill the void. This paper aims to clear up some of the confusion relating to evaluation, by demonstrating that the usefulness of each evaluation method is highly dependent on the assumptions made about the distributions of the dataset and the underlying populations. The behaviour of a number of evaluation measures is compared under common assumptions. Deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged. For most performance evaluation purposes, the latter is thus most appropriate, whilst for comparison of behaviour, Matthews Correlation is recommended.