Reliability measurement without limits

Authors:
Dennis Reidsma;Jean Carletta
Affiliations:
-;-
Venue:
Computational Linguistics
Year:
2008

Citing 9
Cited 14

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
The kappa statistic: a second look

Computational Linguistics
Development and use of a gold-standard data set for subjectivity classifications

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Evaluating Discourse and Dialogue Coding Schemes

Computational Linguistics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Identifying agreement and disagreement in conversational speech: use of Bayesian networks to model pragmatic dependencies

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Identifying Sources of Disagreement: Generalizability Theory in Manual Annotation Studies

Computational Linguistics
Inter-coder agreement for computational linguistics

Computational Linguistics

Inter-coder agreement for computational linguistics

Computational Linguistics
Exploiting 'subjective' annotations

HumanJudge '08 Proceedings of the Workshop on Human Judgements in Computational Linguistics
From annotator agreement to noise models

Computational Linguistics
Learning with annotation noise

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Are you being addressed?: real-time addressee detection to support remote participants in hybrid meetings

SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Multimodal corpora annotation: validation methods to assess coding scheme reliability

Multimodal corpora
On the contextual analysis of agreement scores

Multimodal corpora
Interlingual annotation of parallel text corpora: A new framework for annotation and evaluation

Natural Language Engineering
Some empirical evidence for annotation noise in a benchmarked dataset

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Anveshan: a framework for analysis of multiple annotators' labeling behavior

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Annotating underquantification

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Identifying sources of inter-annotator variation: evaluating two models of argument analysis

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Evaluating the impact of coder errors on active learning

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Pair annotation: adaption of pair programming to corpus annotation

LAW VI '12 Proceedings of the Sixth Linguistic Annotation Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

In computational linguistics, a reliability measurement of 0.8 on some statistic such as κ is widely thought to guarantee that hand-coded data is fit for purpose, with 0.67 to 0.8 tolerable, and lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with low reliability as long as any disagreement among human coders looks like random noise. When the disagreement introduces patterns, however, the machine learner can pick these up just like it picks up the real patterns in the data, making the performance figures look better than they really are. For the range of reliability measures that the field currently accepts, disagreement can appreciably inflate performance figures, and even a measure of 0.8 does not guarantee that what looks like good performance really is. Although this is a commonsense result, it has implications for how we work. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.