Probabilistic reasoning in intelligent systems: networks of plausible inference
Probabilistic reasoning in intelligent systems: networks of plausible inference
Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
The kappa statistic: a second look
Computational Linguistics
Development and use of a gold-standard data set for subjectivity classifications
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Evaluating Discourse and Dialogue Coding Schemes
Computational Linguistics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Identifying Sources of Disagreement: Generalizability Theory in Manual Annotation Studies
Computational Linguistics
Inter-coder agreement for computational linguistics
Computational Linguistics
Inter-coder agreement for computational linguistics
Computational Linguistics
Exploiting 'subjective' annotations
HumanJudge '08 Proceedings of the Workshop on Human Judgements in Computational Linguistics
From annotator agreement to noise models
Computational Linguistics
Learning with annotation noise
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
On the contextual analysis of agreement scores
Multimodal corpora
Interlingual annotation of parallel text corpora: A new framework for annotation and evaluation
Natural Language Engineering
Some empirical evidence for annotation noise in a benchmarked dataset
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Anveshan: a framework for analysis of multiple annotators' labeling behavior
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Annotating underquantification
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Identifying sources of inter-annotator variation: evaluating two models of argument analysis
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Evaluating the impact of coder errors on active learning
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Pair annotation: adaption of pair programming to corpus annotation
LAW VI '12 Proceedings of the Sixth Linguistic Annotation Workshop
Hi-index | 0.00 |
In computational linguistics, a reliability measurement of 0.8 on some statistic such as κ is widely thought to guarantee that hand-coded data is fit for purpose, with 0.67 to 0.8 tolerable, and lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with low reliability as long as any disagreement among human coders looks like random noise. When the disagreement introduces patterns, however, the machine learner can pick these up just like it picks up the real patterns in the data, making the performance figures look better than they really are. For the range of reliability measures that the field currently accepts, disagreement can appreciably inflate performance figures, and even a measure of 0.8 does not guarantee that what looks like good performance really is. Although this is a commonsense result, it has implications for how we work. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.